Apache Ignite Developers - Legacy Mail Archive

Asynchronous registration of binary metadata

Classic

List

Threaded

20 messages Options

Denis Mekhanikov

Asynchronous registration of binary metadata

Hi!

When persistence is enabled, binary metadata is written to disk upon registration. Currently it happens in the discovery thread, which makes processing of related messages very slow.
There are cases, when a lot of nodes and slow disks can make every binary type be registered for several minutes. Plus it blocks processing of other messages.

I propose starting a separate thread that will be responsible for writing binary metadata to disk. So, binary type registration will be considered finished before information about it will is written to disks on all nodes.

The main concern here is data consistency in cases when a node acknowledges type registration and then fails before writing the metadata to disk.
I see two parts of this issue:
Nodes will have different metadata after restarting.
If we write some data into a persisted cache and shut down nodes faster than a new binary type is written to disk, then after a restart we won’t have a binary type to work with.

The first case is similar to a situation, when one node fails, and after that a new type is registered in the cluster. This issue is resolved by the discovery data exchange. All nodes receive information about all binary types in the initial discovery messages sent by other nodes. So, once you restart a node, it will receive information, that it failed to finish writing to disk, from other nodes.
If all nodes shut down before finishing writing the metadata to disk, then after a restart the type will be considered unregistered, so another registration will be required.

The second case is a bit more complicated. But it can be resolved by making the discovery threads on every node create a future, that will be completed when writing to disk is finished. So, every node will have such future, that will reflect the current state of persisting the metadata to disk.
After that, if some operation needs this binary type, it will need to wait on that future until flushing to disk is finished.
This way discovery threads won’t be blocked, but other threads, that actually need this type, will be.

Please let me know what you think about that.

Denis

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

I would also like to mention, that marshaller mappings are written to disk even if persistence is disabled.
So, this issue affects purely in-memory clusters as well.

Denis

> On 13 Aug 2019, at 17:06, Denis Mekhanikov <[hidden email]> wrote:
>
> Hi!
>
> When persistence is enabled, binary metadata is written to disk upon registration. Currently it happens in the discovery thread, which makes processing of related messages very slow.
> There are cases, when a lot of nodes and slow disks can make every binary type be registered for several minutes. Plus it blocks processing of other messages.
>
> I propose starting a separate thread that will be responsible for writing binary metadata to disk. So, binary type registration will be considered finished before information about it will is written to disks on all nodes.
>
> The main concern here is data consistency in cases when a node acknowledges type registration and then fails before writing the metadata to disk.
> I see two parts of this issue:
> Nodes will have different metadata after restarting.
> If we write some data into a persisted cache and shut down nodes faster than a new binary type is written to disk, then after a restart we won’t have a binary type to work with.
>
> The first case is similar to a situation, when one node fails, and after that a new type is registered in the cluster. This issue is resolved by the discovery data exchange. All nodes receive information about all binary types in the initial discovery messages sent by other nodes. So, once you restart a node, it will receive information, that it failed to finish writing to disk, from other nodes.
> If all nodes shut down before finishing writing the metadata to disk, then after a restart the type will be considered unregistered, so another registration will be required.
>
> The second case is a bit more complicated. But it can be resolved by making the discovery threads on every node create a future, that will be completed when writing to disk is finished. So, every node will have such future, that will reflect the current state of persisting the metadata to disk.
> After that, if some operation needs this binary type, it will need to wait on that future until flushing to disk is finished.
> This way discovery threads won’t be blocked, but other threads, that actually need this type, will be.
>
> Please let me know what you think about that.
>
> Denis

Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Denis Mekhanikov,

Currently metadata are fsync'ed on write. This might be the case of
slow-downs in case of metadata burst writes.
I think removing fsync could help to mitigate performance issues with
current implementation until proper solution will be implemented: moving
metadata to metastore.

вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <[hidden email]>:

> I would also like to mention, that marshaller mappings are written to disk
> even if persistence is disabled.
> So, this issue affects purely in-memory clusters as well.
>
> Denis
>
> > On 13 Aug 2019, at 17:06, Denis Mekhanikov <[hidden email]>
> wrote:
> >
> > Hi!
> >
> > When persistence is enabled, binary metadata is written to disk upon
> registration. Currently it happens in the discovery thread, which makes
> processing of related messages very slow.
> > There are cases, when a lot of nodes and slow disks can make every
> binary type be registered for several minutes. Plus it blocks processing of
> other messages.
> >
> > I propose starting a separate thread that will be responsible for
> writing binary metadata to disk. So, binary type registration will be
> considered finished before information about it will is written to disks on
> all nodes.
> >
> > The main concern here is data consistency in cases when a node
> acknowledges type registration and then fails before writing the metadata
> to disk.
> > I see two parts of this issue:
> > Nodes will have different metadata after restarting.
> > If we write some data into a persisted cache and shut down nodes faster
> than a new binary type is written to disk, then after a restart we won’t
> have a binary type to work with.
> >
> > The first case is similar to a situation, when one node fails, and after
> that a new type is registered in the cluster. This issue is resolved by the
> discovery data exchange. All nodes receive information about all binary
> types in the initial discovery messages sent by other nodes. So, once you
> restart a node, it will receive information, that it failed to finish
> writing to disk, from other nodes.
> > If all nodes shut down before finishing writing the metadata to disk,
> then after a restart the type will be considered unregistered, so another
> registration will be required.
> >
> > The second case is a bit more complicated. But it can be resolved by
> making the discovery threads on every node create a future, that will be
> completed when writing to disk is finished. So, every node will have such
> future, that will reflect the current state of persisting the metadata to
> disk.
> > After that, if some operation needs this binary type, it will need to
> wait on that future until flushing to disk is finished.
> > This way discovery threads won’t be blocked, but other threads, that
> actually need this type, will be.
> >
> > Please let me know what you think about that.
> >
> > Denis
>
>

--

Best regards,
Alexei Scherbakov

Zhenya Stanilovsky

Re[2]: Asynchronous registration of binary metadata

Alexey, but in this case customer need to be informed, that whole (for example 1 node) cluster crash (power off) could lead to partial data unavailability.
And may be further index corruption.
1. Why your meta takes a substantial size? may be context leaking ?
2. Could meta be compressed ?

>Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <[hidden email]>:
>
>Denis Mekhanikov,
>
>Currently metadata are fsync'ed on write. This might be the case of
>slow-downs in case of metadata burst writes.
>I think removing fsync could help to mitigate performance issues with
>current implementation until proper solution will be implemented: moving
>metadata to metastore.
>
>
>вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < [hidden email] >:
>
>> I would also like to mention, that marshaller mappings are written to disk
>> even if persistence is disabled.
>> So, this issue affects purely in-memory clusters as well.
>>
>> Denis
>>
>> > On 13 Aug 2019, at 17:06, Denis Mekhanikov < [hidden email] >
>> wrote:
>> >
>> > Hi!
>> >
>> > When persistence is enabled, binary metadata is written to disk upon
>> registration. Currently it happens in the discovery thread, which makes
>> processing of related messages very slow.
>> > There are cases, when a lot of nodes and slow disks can make every
>> binary type be registered for several minutes. Plus it blocks processing of
>> other messages.
>> >
>> > I propose starting a separate thread that will be responsible for
>> writing binary metadata to disk. So, binary type registration will be
>> considered finished before information about it will is written to disks on
>> all nodes.
>> >
>> > The main concern here is data consistency in cases when a node
>> acknowledges type registration and then fails before writing the metadata
>> to disk.
>> > I see two parts of this issue:
>> > Nodes will have different metadata after restarting.
>> > If we write some data into a persisted cache and shut down nodes faster
>> than a new binary type is written to disk, then after a restart we won’t
>> have a binary type to work with.
>> >
>> > The first case is similar to a situation, when one node fails, and after
>> that a new type is registered in the cluster. This issue is resolved by the
>> discovery data exchange. All nodes receive information about all binary
>> types in the initial discovery messages sent by other nodes. So, once you
>> restart a node, it will receive information, that it failed to finish
>> writing to disk, from other nodes.
>> > If all nodes shut down before finishing writing the metadata to disk,
>> then after a restart the type will be considered unregistered, so another
>> registration will be required.
>> >
>> > The second case is a bit more complicated. But it can be resolved by
>> making the discovery threads on every node create a future, that will be
>> completed when writing to disk is finished. So, every node will have such
>> future, that will reflect the current state of persisting the metadata to
>> disk.
>> > After that, if some operation needs this binary type, it will need to
>> wait on that future until flushing to disk is finished.
>> > This way discovery threads won’t be blocked, but other threads, that
>> actually need this type, will be.
>> >
>> > Please let me know what you think about that.
>> >
>> > Denis
>>
>>
>
>--
>
>Best regards,
>Alexei Scherbakov

--
Zhenya Stanilovsky

Ivan Pavlukhin

Re: Re[2]: Asynchronous registration of binary metadata

Denis,

Several clarifying questions:
1. Do you have an idea why metadata registration takes so long? So
poor disks? So many data to write? A contention with disk writes by
other subsystems?
2. Do we need a persistent metadata for in-memory caches? Or is it so
accidentally?

Generally, I think that it is possible to move metadata saving
operations out of discovery thread without loosing required
consistency/integrity.

As Alex mentioned using metastore looks like a better solution. Do we
really need a fast fix here? (Are we talking about fast fix?)

ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky <[hidden email]>:

>
> Alexey, but in this case customer need to be informed, that whole (for example 1 node) cluster crash (power off) could lead to partial data unavailability.
> And may be further index corruption.
> 1. Why your meta takes a substantial size? may be context leaking ?
> 2. Could meta be compressed ?
>
>
> >Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <[hidden email]>:
> >
> >Denis Mekhanikov,
> >
> >Currently metadata are fsync'ed on write. This might be the case of
> >slow-downs in case of metadata burst writes.
> >I think removing fsync could help to mitigate performance issues with
> >current implementation until proper solution will be implemented: moving
> >metadata to metastore.
> >
> >
> >вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < [hidden email] >:
> >
> >> I would also like to mention, that marshaller mappings are written to disk
> >> even if persistence is disabled.
> >> So, this issue affects purely in-memory clusters as well.
> >>
> >> Denis
> >>
> >> > On 13 Aug 2019, at 17:06, Denis Mekhanikov < [hidden email] >
> >> wrote:
> >> >
> >> > Hi!
> >> >
> >> > When persistence is enabled, binary metadata is written to disk upon
> >> registration. Currently it happens in the discovery thread, which makes
> >> processing of related messages very slow.
> >> > There are cases, when a lot of nodes and slow disks can make every
> >> binary type be registered for several minutes. Plus it blocks processing of
> >> other messages.
> >> >
> >> > I propose starting a separate thread that will be responsible for
> >> writing binary metadata to disk. So, binary type registration will be
> >> considered finished before information about it will is written to disks on
> >> all nodes.
> >> >
> >> > The main concern here is data consistency in cases when a node
> >> acknowledges type registration and then fails before writing the metadata
> >> to disk.
> >> > I see two parts of this issue:
> >> > Nodes will have different metadata after restarting.
> >> > If we write some data into a persisted cache and shut down nodes faster
> >> than a new binary type is written to disk, then after a restart we won’t
> >> have a binary type to work with.
> >> >
> >> > The first case is similar to a situation, when one node fails, and after
> >> that a new type is registered in the cluster. This issue is resolved by the
> >> discovery data exchange. All nodes receive information about all binary
> >> types in the initial discovery messages sent by other nodes. So, once you
> >> restart a node, it will receive information, that it failed to finish
> >> writing to disk, from other nodes.
> >> > If all nodes shut down before finishing writing the metadata to disk,
> >> then after a restart the type will be considered unregistered, so another
> >> registration will be required.
> >> >
> >> > The second case is a bit more complicated. But it can be resolved by
> >> making the discovery threads on every node create a future, that will be
> >> completed when writing to disk is finished. So, every node will have such
> >> future, that will reflect the current state of persisting the metadata to
> >> disk.
> >> > After that, if some operation needs this binary type, it will need to
> >> wait on that future until flushing to disk is finished.
> >> > This way discovery threads won’t be blocked, but other threads, that
> >> actually need this type, will be.
> >> >
> >> > Please let me know what you think about that.
> >> >
> >> > Denis
> >>
> >>
> >
> >--
> >
> >Best regards,
> >Alexei Scherbakov
>
>
> --
> Zhenya Stanilovsky

--
Best regards,
Ivan Pavlukhin

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Folks,

Thanks for showing interest in this issue!

Alexey,

> I think removing fsync could help to mitigate performance issues with current implementation

Is my understanding correct, that if we remove fsync, then discovery won’t be blocked, and data will be flushed to disk in background, and loss of information will be possible only on OS failure? It sounds like an acceptable workaround to me.

Will moving metadata to metastore actually resolve this issue? Please correct me if I’m wrong, but we will still need to write the information to WAL before releasing the discovery thread. If WAL mode is FSYNC, then the issue will still be there. Or is it planned to abandon the discovery-based protocol at all?

Evgeniy, Ivan,

In my particular case the data wasn’t too big. It was a slow virtualised disk with encryption, that made operations slow. Given that there are 200 nodes in a cluster, where every node writes slowly, and this process is sequential, one piece of metadata is registered extremely slowly.

Ivan, answering to your other questions:

> 2. Do we need a persistent metadata for in-memory caches? Or is it so accidentally?

It should be checked, if it’s safe to stop writing marshaller mappings to disk without loosing any guarantees.
But anyway, I would like to have a property, that would control this. If metadata registration is slow, then initial cluster warmup may take a while. So, if we preserve metadata on disk, then we will need to warm it up only once, and further restarts won’t be affected.

> Do we really need a fast fix here?

I would like a fix, that could be implemented now, since the activity with moving metadata to metastore doesn’t sound like a quick one. Having a temporary solution would be nice.

Denis

> On 14 Aug 2019, at 11:53, Павлухин Иван <[hidden email]> wrote:
>
> Denis,
>
> Several clarifying questions:
> 1. Do you have an idea why metadata registration takes so long? So
> poor disks? So many data to write? A contention with disk writes by
> other subsystems?
> 2. Do we need a persistent metadata for in-memory caches? Or is it so
> accidentally?
>
> Generally, I think that it is possible to move metadata saving
> operations out of discovery thread without loosing required
> consistency/integrity.
>
> As Alex mentioned using metastore looks like a better solution. Do we
> really need a fast fix here? (Are we talking about fast fix?)
>
> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky <[hidden email]>:
>>
>> Alexey, but in this case customer need to be informed, that whole (for example 1 node) cluster crash (power off) could lead to partial data unavailability.
>> And may be further index corruption.
>> 1. Why your meta takes a substantial size? may be context leaking ?
>> 2. Could meta be compressed ?
>>
>>
>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <[hidden email]>:
>>>
>>> Denis Mekhanikov,
>>>
>>> Currently metadata are fsync'ed on write. This might be the case of
>>> slow-downs in case of metadata burst writes.
>>> I think removing fsync could help to mitigate performance issues with
>>> current implementation until proper solution will be implemented: moving
>>> metadata to metastore.
>>>
>>>
>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < [hidden email] >:
>>>
>>>> I would also like to mention, that marshaller mappings are written to disk
>>>> even if persistence is disabled.
>>>> So, this issue affects purely in-memory clusters as well.
>>>>
>>>> Denis
>>>>
>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < [hidden email] >
>>>> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> When persistence is enabled, binary metadata is written to disk upon
>>>> registration. Currently it happens in the discovery thread, which makes
>>>> processing of related messages very slow.
>>>>> There are cases, when a lot of nodes and slow disks can make every
>>>> binary type be registered for several minutes. Plus it blocks processing of
>>>> other messages.
>>>>>
>>>>> I propose starting a separate thread that will be responsible for
>>>> writing binary metadata to disk. So, binary type registration will be
>>>> considered finished before information about it will is written to disks on
>>>> all nodes.
>>>>>
>>>>> The main concern here is data consistency in cases when a node
>>>> acknowledges type registration and then fails before writing the metadata
>>>> to disk.
>>>>> I see two parts of this issue:
>>>>> Nodes will have different metadata after restarting.
>>>>> If we write some data into a persisted cache and shut down nodes faster
>>>> than a new binary type is written to disk, then after a restart we won’t
>>>> have a binary type to work with.
>>>>>
>>>>> The first case is similar to a situation, when one node fails, and after
>>>> that a new type is registered in the cluster. This issue is resolved by the
>>>> discovery data exchange. All nodes receive information about all binary
>>>> types in the initial discovery messages sent by other nodes. So, once you
>>>> restart a node, it will receive information, that it failed to finish
>>>> writing to disk, from other nodes.
>>>>> If all nodes shut down before finishing writing the metadata to disk,
>>>> then after a restart the type will be considered unregistered, so another
>>>> registration will be required.
>>>>>
>>>>> The second case is a bit more complicated. But it can be resolved by
>>>> making the discovery threads on every node create a future, that will be
>>>> completed when writing to disk is finished. So, every node will have such
>>>> future, that will reflect the current state of persisting the metadata to
>>>> disk.
>>>>> After that, if some operation needs this binary type, it will need to
>>>> wait on that future until flushing to disk is finished.
>>>>> This way discovery threads won’t be blocked, but other threads, that
>>>> actually need this type, will be.
>>>>>
>>>>> Please let me know what you think about that.
>>>>>
>>>>> Denis
>>>>
>>>>
>>>
>>> --
>>>
>>> Best regards,
>>> Alexei Scherbakov
>>
>>
>> --
>> Zhenya Stanilovsky
>
>
>
> --
> Best regards,
> Ivan Pavlukhin

Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Denis Mekhanikov,

1. Yes, only on OS failures. In such case data will be received from alive
nodes later.
2. Yes, for walmode=FSYNC writes to metastore will be slow. But such mode
should not be used if you have more than two nodes in grid because it has
huge impact on performance.

ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <[hidden email]>:

> Folks,
>
> Thanks for showing interest in this issue!
>
> Alexey,
>
> > I think removing fsync could help to mitigate performance issues with
> current implementation
>
> Is my understanding correct, that if we remove fsync, then discovery won’t
> be blocked, and data will be flushed to disk in background, and loss of
> information will be possible only on OS failure? It sounds like an
> acceptable workaround to me.
>
> Will moving metadata to metastore actually resolve this issue? Please
> correct me if I’m wrong, but we will still need to write the information to
> WAL before releasing the discovery thread. If WAL mode is FSYNC, then the
> issue will still be there. Or is it planned to abandon the discovery-based
> protocol at all?
>
> Evgeniy, Ivan,
>
> In my particular case the data wasn’t too big. It was a slow virtualised
> disk with encryption, that made operations slow. Given that there are 200
> nodes in a cluster, where every node writes slowly, and this process is
> sequential, one piece of metadata is registered extremely slowly.
>
> Ivan, answering to your other questions:
>
> > 2. Do we need a persistent metadata for in-memory caches? Or is it so
> accidentally?
>
> It should be checked, if it’s safe to stop writing marshaller mappings to
> disk without loosing any guarantees.
> But anyway, I would like to have a property, that would control this. If
> metadata registration is slow, then initial cluster warmup may take a
> while. So, if we preserve metadata on disk, then we will need to warm it up
> only once, and further restarts won’t be affected.
>
> > Do we really need a fast fix here?
>
> I would like a fix, that could be implemented now, since the activity with
> moving metadata to metastore doesn’t sound like a quick one. Having a
> temporary solution would be nice.
>
> Denis
>
> > On 14 Aug 2019, at 11:53, Павлухин Иван <[hidden email]> wrote:
> >
> > Denis,
> >
> > Several clarifying questions:
> > 1. Do you have an idea why metadata registration takes so long? So
> > poor disks? So many data to write? A contention with disk writes by
> > other subsystems?
> > 2. Do we need a persistent metadata for in-memory caches? Or is it so
> > accidentally?
> >
> > Generally, I think that it is possible to move metadata saving
> > operations out of discovery thread without loosing required
> > consistency/integrity.
> >
> > As Alex mentioned using metastore looks like a better solution. Do we
> > really need a fast fix here? (Are we talking about fast fix?)
> >
> > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> <[hidden email]>:
> >>
> >> Alexey, but in this case customer need to be informed, that whole (for
> example 1 node) cluster crash (power off) could lead to partial data
> unavailability.
> >> And may be further index corruption.
> >> 1. Why your meta takes a substantial size? may be context leaking ?
> >> 2. Could meta be compressed ?
> >>
> >>
> >>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> [hidden email]>:
> >>>
> >>> Denis Mekhanikov,
> >>>
> >>> Currently metadata are fsync'ed on write. This might be the case of
> >>> slow-downs in case of metadata burst writes.
> >>> I think removing fsync could help to mitigate performance issues with
> >>> current implementation until proper solution will be implemented:
> moving
> >>> metadata to metastore.
> >>>
> >>>
> >>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < [hidden email]
> >:
> >>>
> >>>> I would also like to mention, that marshaller mappings are written to
> disk
> >>>> even if persistence is disabled.
> >>>> So, this issue affects purely in-memory clusters as well.
> >>>>
> >>>> Denis
> >>>>
> >>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < [hidden email] >
> >>>> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> When persistence is enabled, binary metadata is written to disk upon
> >>>> registration. Currently it happens in the discovery thread, which
> makes
> >>>> processing of related messages very slow.
> >>>>> There are cases, when a lot of nodes and slow disks can make every
> >>>> binary type be registered for several minutes. Plus it blocks
> processing of
> >>>> other messages.
> >>>>>
> >>>>> I propose starting a separate thread that will be responsible for
> >>>> writing binary metadata to disk. So, binary type registration will be
> >>>> considered finished before information about it will is written to
> disks on
> >>>> all nodes.
> >>>>>
> >>>>> The main concern here is data consistency in cases when a node
> >>>> acknowledges type registration and then fails before writing the
> metadata
> >>>> to disk.
> >>>>> I see two parts of this issue:
> >>>>> Nodes will have different metadata after restarting.
> >>>>> If we write some data into a persisted cache and shut down nodes
> faster
> >>>> than a new binary type is written to disk, then after a restart we
> won’t
> >>>> have a binary type to work with.
> >>>>>
> >>>>> The first case is similar to a situation, when one node fails, and
> after
> >>>> that a new type is registered in the cluster. This issue is resolved
> by the
> >>>> discovery data exchange. All nodes receive information about all
> binary
> >>>> types in the initial discovery messages sent by other nodes. So, once
> you
> >>>> restart a node, it will receive information, that it failed to finish
> >>>> writing to disk, from other nodes.
> >>>>> If all nodes shut down before finishing writing the metadata to disk,
> >>>> then after a restart the type will be considered unregistered, so
> another
> >>>> registration will be required.
> >>>>>
> >>>>> The second case is a bit more complicated. But it can be resolved by
> >>>> making the discovery threads on every node create a future, that will
> be
> >>>> completed when writing to disk is finished. So, every node will have
> such
> >>>> future, that will reflect the current state of persisting the
> metadata to
> >>>> disk.
> >>>>> After that, if some operation needs this binary type, it will need to
> >>>> wait on that future until flushing to disk is finished.
> >>>>> This way discovery threads won’t be blocked, but other threads, that
> >>>> actually need this type, will be.
> >>>>>
> >>>>> Please let me know what you think about that.
> >>>>>
> >>>>> Denis
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> Best regards,
> >>> Alexei Scherbakov
> >>
> >>
> >> --
> >> Zhenya Stanilovsky
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
>
>

--

Best regards,
Alexei Scherbakov

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Alexey,

I still don’t understand completely if by using metastore we are going to stop using discovery for metadata registration, or not. Could you clarify that point?
Is it going to be a distributed metastore or a local one?

Are there any relevant JIRA tickets for this change?

Denis

> On 14 Aug 2019, at 19:37, Alexei Scherbakov <[hidden email]> wrote:
>
> Denis Mekhanikov,
>
> 1. Yes, only on OS failures. In such case data will be received from alive
> nodes later.
> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such mode
> should not be used if you have more than two nodes in grid because it has
> huge impact on performance.
>
> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <[hidden email]>:
>
>> Folks,
>>
>> Thanks for showing interest in this issue!
>>
>> Alexey,
>>
>>> I think removing fsync could help to mitigate performance issues with
>> current implementation
>>
>> Is my understanding correct, that if we remove fsync, then discovery won’t
>> be blocked, and data will be flushed to disk in background, and loss of
>> information will be possible only on OS failure? It sounds like an
>> acceptable workaround to me.
>>
>> Will moving metadata to metastore actually resolve this issue? Please
>> correct me if I’m wrong, but we will still need to write the information to
>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then the
>> issue will still be there. Or is it planned to abandon the discovery-based
>> protocol at all?
>>
>> Evgeniy, Ivan,
>>
>> In my particular case the data wasn’t too big. It was a slow virtualised
>> disk with encryption, that made operations slow. Given that there are 200
>> nodes in a cluster, where every node writes slowly, and this process is
>> sequential, one piece of metadata is registered extremely slowly.
>>
>> Ivan, answering to your other questions:
>>
>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>> accidentally?
>>
>> It should be checked, if it’s safe to stop writing marshaller mappings to
>> disk without loosing any guarantees.
>> But anyway, I would like to have a property, that would control this. If
>> metadata registration is slow, then initial cluster warmup may take a
>> while. So, if we preserve metadata on disk, then we will need to warm it up
>> only once, and further restarts won’t be affected.
>>
>>> Do we really need a fast fix here?
>>
>> I would like a fix, that could be implemented now, since the activity with
>> moving metadata to metastore doesn’t sound like a quick one. Having a
>> temporary solution would be nice.
>>
>> Denis
>>
>>> On 14 Aug 2019, at 11:53, Павлухин Иван <[hidden email]> wrote:
>>>
>>> Denis,
>>>
>>> Several clarifying questions:
>>> 1. Do you have an idea why metadata registration takes so long? So
>>> poor disks? So many data to write? A contention with disk writes by
>>> other subsystems?
>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>> accidentally?
>>>
>>> Generally, I think that it is possible to move metadata saving
>>> operations out of discovery thread without loosing required
>>> consistency/integrity.
>>>
>>> As Alex mentioned using metastore looks like a better solution. Do we
>>> really need a fast fix here? (Are we talking about fast fix?)
>>>
>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>> <[hidden email]>:
>>>>
>>>> Alexey, but in this case customer need to be informed, that whole (for
>> example 1 node) cluster crash (power off) could lead to partial data
>> unavailability.
>>>> And may be further index corruption.
>>>> 1. Why your meta takes a substantial size? may be context leaking ?
>>>> 2. Could meta be compressed ?
>>>>
>>>>
>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>> [hidden email]>:
>>>>>
>>>>> Denis Mekhanikov,
>>>>>
>>>>> Currently metadata are fsync'ed on write. This might be the case of
>>>>> slow-downs in case of metadata burst writes.
>>>>> I think removing fsync could help to mitigate performance issues with
>>>>> current implementation until proper solution will be implemented:
>> moving
>>>>> metadata to metastore.
>>>>>
>>>>>
>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < [hidden email]
>>> :
>>>>>
>>>>>> I would also like to mention, that marshaller mappings are written to
>> disk
>>>>>> even if persistence is disabled.
>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>
>>>>>> Denis
>>>>>>
>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < [hidden email] >
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> When persistence is enabled, binary metadata is written to disk upon
>>>>>> registration. Currently it happens in the discovery thread, which
>> makes
>>>>>> processing of related messages very slow.
>>>>>>> There are cases, when a lot of nodes and slow disks can make every
>>>>>> binary type be registered for several minutes. Plus it blocks
>> processing of
>>>>>> other messages.
>>>>>>>
>>>>>>> I propose starting a separate thread that will be responsible for
>>>>>> writing binary metadata to disk. So, binary type registration will be
>>>>>> considered finished before information about it will is written to
>> disks on
>>>>>> all nodes.
>>>>>>>
>>>>>>> The main concern here is data consistency in cases when a node
>>>>>> acknowledges type registration and then fails before writing the
>> metadata
>>>>>> to disk.
>>>>>>> I see two parts of this issue:
>>>>>>> Nodes will have different metadata after restarting.
>>>>>>> If we write some data into a persisted cache and shut down nodes
>> faster
>>>>>> than a new binary type is written to disk, then after a restart we
>> won’t
>>>>>> have a binary type to work with.
>>>>>>>
>>>>>>> The first case is similar to a situation, when one node fails, and
>> after
>>>>>> that a new type is registered in the cluster. This issue is resolved
>> by the
>>>>>> discovery data exchange. All nodes receive information about all
>> binary
>>>>>> types in the initial discovery messages sent by other nodes. So, once
>> you
>>>>>> restart a node, it will receive information, that it failed to finish
>>>>>> writing to disk, from other nodes.
>>>>>>> If all nodes shut down before finishing writing the metadata to disk,
>>>>>> then after a restart the type will be considered unregistered, so
>> another
>>>>>> registration will be required.
>>>>>>>
>>>>>>> The second case is a bit more complicated. But it can be resolved by
>>>>>> making the discovery threads on every node create a future, that will
>> be
>>>>>> completed when writing to disk is finished. So, every node will have
>> such
>>>>>> future, that will reflect the current state of persisting the
>> metadata to
>>>>>> disk.
>>>>>>> After that, if some operation needs this binary type, it will need to
>>>>>> wait on that future until flushing to disk is finished.
>>>>>>> This way discovery threads won’t be blocked, but other threads, that
>>>>>> actually need this type, will be.
>>>>>>>
>>>>>>> Please let me know what you think about that.
>>>>>>>
>>>>>>> Denis
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best regards,
>>>>> Alexei Scherbakov
>>>>
>>>>
>>>> --
>>>> Zhenya Stanilovsky
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ivan Pavlukhin
>>
>>
>
> --
>
> Best regards,
> Alexei Scherbakov

Zhenya Stanilovsky

Re[2]: Asynchronous registration of binary metadata

>
>> 1. Yes, only on OS failures. In such case data will be received from alive
>> nodes later.
What behavior would be in case of one node ? I suppose someone can obtain cache data without unmarshalling schema, what in this case would be with grid operability?

>
>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such mode
>> should not be used if you have more than two nodes in grid because it has
>> huge impact on performance.
Is wal mode affects metadata store ?

>
>>
>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < [hidden email] >:
>>
>>> Folks,
>>>
>>> Thanks for showing interest in this issue!
>>>
>>> Alexey,
>>>
>>>> I think removing fsync could help to mitigate performance issues with
>>> current implementation
>>>
>>> Is my understanding correct, that if we remove fsync, then discovery won’t
>>> be blocked, and data will be flushed to disk in background, and loss of
>>> information will be possible only on OS failure? It sounds like an
>>> acceptable workaround to me.
>>>
>>> Will moving metadata to metastore actually resolve this issue? Please
>>> correct me if I’m wrong, but we will still need to write the information to
>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then the
>>> issue will still be there. Or is it planned to abandon the discovery-based
>>> protocol at all?
>>>
>>> Evgeniy, Ivan,
>>>
>>> In my particular case the data wasn’t too big. It was a slow virtualised
>>> disk with encryption, that made operations slow. Given that there are 200
>>> nodes in a cluster, where every node writes slowly, and this process is
>>> sequential, one piece of metadata is registered extremely slowly.
>>>
>>> Ivan, answering to your other questions:
>>>
>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>> accidentally?
>>>
>>> It should be checked, if it’s safe to stop writing marshaller mappings to
>>> disk without loosing any guarantees.
>>> But anyway, I would like to have a property, that would control this. If
>>> metadata registration is slow, then initial cluster warmup may take a
>>> while. So, if we preserve metadata on disk, then we will need to warm it up
>>> only once, and further restarts won’t be affected.
>>>
>>>> Do we really need a fast fix here?
>>>
>>> I would like a fix, that could be implemented now, since the activity with
>>> moving metadata to metastore doesn’t sound like a quick one. Having a
>>> temporary solution would be nice.
>>>
>>> Denis
>>>
>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] > wrote:
>>>>
>>>> Denis,
>>>>
>>>> Several clarifying questions:
>>>> 1. Do you have an idea why metadata registration takes so long? So
>>>> poor disks? So many data to write? A contention with disk writes by
>>>> other subsystems?
>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>>> accidentally?
>>>>
>>>> Generally, I think that it is possible to move metadata saving
>>>> operations out of discovery thread without loosing required
>>>> consistency/integrity.
>>>>
>>>> As Alex mentioned using metastore looks like a better solution. Do we
>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>
>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>> < [hidden email] >:
>>>>>
>>>>> Alexey, but in this case customer need to be informed, that whole (for
>>> example 1 node) cluster crash (power off) could lead to partial data
>>> unavailability.
>>>>> And may be further index corruption.
>>>>> 1. Why your meta takes a substantial size? may be context leaking ?
>>>>> 2. Could meta be compressed ?
>>>>>
>>>>>
>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>> [hidden email] >:
>>>>>>
>>>>>> Denis Mekhanikov,
>>>>>>
>>>>>> Currently metadata are fsync'ed on write. This might be the case of
>>>>>> slow-downs in case of metadata burst writes.
>>>>>> I think removing fsync could help to mitigate performance issues with
>>>>>> current implementation until proper solution will be implemented:
>>> moving
>>>>>> metadata to metastore.
>>>>>>
>>>>>>
>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < [hidden email]
>>>> :
>>>>>>
>>>>>>> I would also like to mention, that marshaller mappings are written to
>>> disk
>>>>>>> even if persistence is disabled.
>>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>>
>>>>>>> Denis
>>>>>>>
>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < [hidden email] >
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> When persistence is enabled, binary metadata is written to disk upon
>>>>>>> registration. Currently it happens in the discovery thread, which
>>> makes
>>>>>>> processing of related messages very slow.
>>>>>>>> There are cases, when a lot of nodes and slow disks can make every
>>>>>>> binary type be registered for several minutes. Plus it blocks
>>> processing of
>>>>>>> other messages.
>>>>>>>>
>>>>>>>> I propose starting a separate thread that will be responsible for
>>>>>>> writing binary metadata to disk. So, binary type registration will be
>>>>>>> considered finished before information about it will is written to
>>> disks on
>>>>>>> all nodes.
>>>>>>>>
>>>>>>>> The main concern here is data consistency in cases when a node
>>>>>>> acknowledges type registration and then fails before writing the
>>> metadata
>>>>>>> to disk.
>>>>>>>> I see two parts of this issue:
>>>>>>>> Nodes will have different metadata after restarting.
>>>>>>>> If we write some data into a persisted cache and shut down nodes
>>> faster
>>>>>>> than a new binary type is written to disk, then after a restart we
>>> won’t
>>>>>>> have a binary type to work with.
>>>>>>>>
>>>>>>>> The first case is similar to a situation, when one node fails, and
>>> after
>>>>>>> that a new type is registered in the cluster. This issue is resolved
>>> by the
>>>>>>> discovery data exchange. All nodes receive information about all
>>> binary
>>>>>>> types in the initial discovery messages sent by other nodes. So, once
>>> you
>>>>>>> restart a node, it will receive information, that it failed to finish
>>>>>>> writing to disk, from other nodes.
>>>>>>>> If all nodes shut down before finishing writing the metadata to disk,
>>>>>>> then after a restart the type will be considered unregistered, so
>>> another
>>>>>>> registration will be required.
>>>>>>>>
>>>>>>>> The second case is a bit more complicated. But it can be resolved by
>>>>>>> making the discovery threads on every node create a future, that will
>>> be
>>>>>>> completed when writing to disk is finished. So, every node will have
>>> such
>>>>>>> future, that will reflect the current state of persisting the
>>> metadata to
>>>>>>> disk.
>>>>>>>> After that, if some operation needs this binary type, it will need to
>>>>>>> wait on that future until flushing to disk is finished.
>>>>>>>> This way discovery threads won’t be blocked, but other threads, that
>>>>>>> actually need this type, will be.
>>>>>>>>
>>>>>>>> Please let me know what you think about that.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Best regards,
>>>>>> Alexei Scherbakov
>>>>>
>>>>>
>>>>> --
>>>>> Zhenya Stanilovsky
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ivan Pavlukhin
>>>
>>>
>>
>> --
>>
>> Best regards,
>> Alexei Scherbakov
>

--
Zhenya Stanilovsky

Sergey Chugunov

Re: Re[2]: Asynchronous registration of binary metadata

Denis,

Thanks for bringing this issue up, decision to write binary metadata from
discovery thread was really a tough decision to make.
I don't think that moving metadata to metastorage is a silver bullet here
as this approach also has its drawbacks and is not an easy change.

In addition to workarounds suggested by Alexei we have two choices to
offload write operation from discovery thread:

1. Your scheme with a separate writer thread and futures completed when
write operation is finished.
2. PME-like protocol with obvious complications like failover and
asynchronous wait for replies over communication layer.

Your suggestion looks easier from code complexity perspective but in my
view it increases chances to get into starvation. Now if some node faces
really long delays during write op it is gonna be kicked out of topology by
discovery protocol. In your case it is possible that more and more threads
from other pools may stuck waiting on the operation future, it is also not
good.

What do you think?

I also think that if we want to approach this issue systematically, we need
to do a deep analysis of metastorage option as well and to finally choose
which road we wanna go.

Thanks!

On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
<[hidden email]> wrote:

>
> >
> >> 1. Yes, only on OS failures. In such case data will be received from
> alive
> >> nodes later.
> What behavior would be in case of one node ? I suppose someone can obtain
> cache data without unmarshalling schema, what in this case would be with
> grid operability?
>
> >
> >> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
> mode
> >> should not be used if you have more than two nodes in grid because it
> has
> >> huge impact on performance.
> Is wal mode affects metadata store ?
>
> >
> >>
> >> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < [hidden email]
> >:
> >>
> >>> Folks,
> >>>
> >>> Thanks for showing interest in this issue!
> >>>
> >>> Alexey,
> >>>
> >>>> I think removing fsync could help to mitigate performance issues with
> >>> current implementation
> >>>
> >>> Is my understanding correct, that if we remove fsync, then discovery
> won’t
> >>> be blocked, and data will be flushed to disk in background, and loss of
> >>> information will be possible only on OS failure? It sounds like an
> >>> acceptable workaround to me.
> >>>
> >>> Will moving metadata to metastore actually resolve this issue? Please
> >>> correct me if I’m wrong, but we will still need to write the
> information to
> >>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
> the
> >>> issue will still be there. Or is it planned to abandon the
> discovery-based
> >>> protocol at all?
> >>>
> >>> Evgeniy, Ivan,
> >>>
> >>> In my particular case the data wasn’t too big. It was a slow
> virtualised
> >>> disk with encryption, that made operations slow. Given that there are
> 200
> >>> nodes in a cluster, where every node writes slowly, and this process is
> >>> sequential, one piece of metadata is registered extremely slowly.
> >>>
> >>> Ivan, answering to your other questions:
> >>>
> >>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
> >>> accidentally?
> >>>
> >>> It should be checked, if it’s safe to stop writing marshaller mappings
> to
> >>> disk without loosing any guarantees.
> >>> But anyway, I would like to have a property, that would control this.
> If
> >>> metadata registration is slow, then initial cluster warmup may take a
> >>> while. So, if we preserve metadata on disk, then we will need to warm
> it up
> >>> only once, and further restarts won’t be affected.
> >>>
> >>>> Do we really need a fast fix here?
> >>>
> >>> I would like a fix, that could be implemented now, since the activity
> with
> >>> moving metadata to metastore doesn’t sound like a quick one. Having a
> >>> temporary solution would be nice.
> >>>
> >>> Denis
> >>>
> >>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
> wrote:
> >>>>
> >>>> Denis,
> >>>>
> >>>> Several clarifying questions:
> >>>> 1. Do you have an idea why metadata registration takes so long? So
> >>>> poor disks? So many data to write? A contention with disk writes by
> >>>> other subsystems?
> >>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
> >>>> accidentally?
> >>>>
> >>>> Generally, I think that it is possible to move metadata saving
> >>>> operations out of discovery thread without loosing required
> >>>> consistency/integrity.
> >>>>
> >>>> As Alex mentioned using metastore looks like a better solution. Do we
> >>>> really need a fast fix here? (Are we talking about fast fix?)
> >>>>
> >>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>> < [hidden email] >:
> >>>>>
> >>>>> Alexey, but in this case customer need to be informed, that whole
> (for
> >>> example 1 node) cluster crash (power off) could lead to partial data
> >>> unavailability.
> >>>>> And may be further index corruption.
> >>>>> 1. Why your meta takes a substantial size? may be context leaking ?
> >>>>> 2. Could meta be compressed ?
> >>>>>
> >>>>>
> >>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> >>> [hidden email] >:
> >>>>>>
> >>>>>> Denis Mekhanikov,
> >>>>>>
> >>>>>> Currently metadata are fsync'ed on write. This might be the case of
> >>>>>> slow-downs in case of metadata burst writes.
> >>>>>> I think removing fsync could help to mitigate performance issues
> with
> >>>>>> current implementation until proper solution will be implemented:
> >>> moving
> >>>>>> metadata to metastore.
> >>>>>>
> >>>>>>
> >>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> [hidden email]
> >>>> :
> >>>>>>
> >>>>>>> I would also like to mention, that marshaller mappings are written
> to
> >>> disk
> >>>>>>> even if persistence is disabled.
> >>>>>>> So, this issue affects purely in-memory clusters as well.
> >>>>>>>
> >>>>>>> Denis
> >>>>>>>
> >>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> [hidden email] >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> When persistence is enabled, binary metadata is written to disk
> upon
> >>>>>>> registration. Currently it happens in the discovery thread, which
> >>> makes
> >>>>>>> processing of related messages very slow.
> >>>>>>>> There are cases, when a lot of nodes and slow disks can make every
> >>>>>>> binary type be registered for several minutes. Plus it blocks
> >>> processing of
> >>>>>>> other messages.
> >>>>>>>>
> >>>>>>>> I propose starting a separate thread that will be responsible for
> >>>>>>> writing binary metadata to disk. So, binary type registration will
> be
> >>>>>>> considered finished before information about it will is written to
> >>> disks on
> >>>>>>> all nodes.
> >>>>>>>>
> >>>>>>>> The main concern here is data consistency in cases when a node
> >>>>>>> acknowledges type registration and then fails before writing the
> >>> metadata
> >>>>>>> to disk.
> >>>>>>>> I see two parts of this issue:
> >>>>>>>> Nodes will have different metadata after restarting.
> >>>>>>>> If we write some data into a persisted cache and shut down nodes
> >>> faster
> >>>>>>> than a new binary type is written to disk, then after a restart we
> >>> won’t
> >>>>>>> have a binary type to work with.
> >>>>>>>>
> >>>>>>>> The first case is similar to a situation, when one node fails, and
> >>> after
> >>>>>>> that a new type is registered in the cluster. This issue is
> resolved
> >>> by the
> >>>>>>> discovery data exchange. All nodes receive information about all
> >>> binary
> >>>>>>> types in the initial discovery messages sent by other nodes. So,
> once
> >>> you
> >>>>>>> restart a node, it will receive information, that it failed to
> finish
> >>>>>>> writing to disk, from other nodes.
> >>>>>>>> If all nodes shut down before finishing writing the metadata to
> disk,
> >>>>>>> then after a restart the type will be considered unregistered, so
> >>> another
> >>>>>>> registration will be required.
> >>>>>>>>
> >>>>>>>> The second case is a bit more complicated. But it can be resolved
> by
> >>>>>>> making the discovery threads on every node create a future, that
> will
> >>> be
> >>>>>>> completed when writing to disk is finished. So, every node will
> have
> >>> such
> >>>>>>> future, that will reflect the current state of persisting the
> >>> metadata to
> >>>>>>> disk.
> >>>>>>>> After that, if some operation needs this binary type, it will
> need to
> >>>>>>> wait on that future until flushing to disk is finished.
> >>>>>>>> This way discovery threads won’t be blocked, but other threads,
> that
> >>>>>>> actually need this type, will be.
> >>>>>>>>
> >>>>>>>> Please let me know what you think about that.
> >>>>>>>>
> >>>>>>>> Denis
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Alexei Scherbakov
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Zhenya Stanilovsky
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best regards,
> >>>> Ivan Pavlukhin
> >>>
> >>>
> >>
> >> --
> >>
> >> Best regards,
> >> Alexei Scherbakov
> >
>
>
> --
> Zhenya Stanilovsky
>

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Sergey,

Currently metadata is written to disk sequentially on every node. Only one node at a time is able to write metadata to its storage.
Slowness accumulates when you add more nodes. A delay required to write one piece of metadata may be not that big, but if you multiply it by say 200, then it becomes noticeable.
But If we move the writing out from discovery threads, then nodes will be doing it in parallel.

I think, it’s better to block some threads from a striped pool for a little while rather than blocking discovery for the same period, but multiplied by a number of nodes.

What do you think?

Denis

> On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]> wrote:
>
> Denis,
>
> Thanks for bringing this issue up, decision to write binary metadata from
> discovery thread was really a tough decision to make.
> I don't think that moving metadata to metastorage is a silver bullet here
> as this approach also has its drawbacks and is not an easy change.
>
> In addition to workarounds suggested by Alexei we have two choices to
> offload write operation from discovery thread:
>
> 1. Your scheme with a separate writer thread and futures completed when
> write operation is finished.
> 2. PME-like protocol with obvious complications like failover and
> asynchronous wait for replies over communication layer.
>
> Your suggestion looks easier from code complexity perspective but in my
> view it increases chances to get into starvation. Now if some node faces
> really long delays during write op it is gonna be kicked out of topology by
> discovery protocol. In your case it is possible that more and more threads
> from other pools may stuck waiting on the operation future, it is also not
> good.
>
> What do you think?
>
> I also think that if we want to approach this issue systematically, we need
> to do a deep analysis of metastorage option as well and to finally choose
> which road we wanna go.
>
> Thanks!
>
> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> <[hidden email]> wrote:
>
>>
>>>
>>>> 1. Yes, only on OS failures. In such case data will be received from
>> alive
>>>> nodes later.
>> What behavior would be in case of one node ? I suppose someone can obtain
>> cache data without unmarshalling schema, what in this case would be with
>> grid operability?
>>
>>>
>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
>> mode
>>>> should not be used if you have more than two nodes in grid because it
>> has
>>>> huge impact on performance.
>> Is wal mode affects metadata store ?
>>
>>>
>>>>
>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < [hidden email]
>>> :
>>>>
>>>>> Folks,
>>>>>
>>>>> Thanks for showing interest in this issue!
>>>>>
>>>>> Alexey,
>>>>>
>>>>>> I think removing fsync could help to mitigate performance issues with
>>>>> current implementation
>>>>>
>>>>> Is my understanding correct, that if we remove fsync, then discovery
>> won’t
>>>>> be blocked, and data will be flushed to disk in background, and loss of
>>>>> information will be possible only on OS failure? It sounds like an
>>>>> acceptable workaround to me.
>>>>>
>>>>> Will moving metadata to metastore actually resolve this issue? Please
>>>>> correct me if I’m wrong, but we will still need to write the
>> information to
>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
>> the
>>>>> issue will still be there. Or is it planned to abandon the
>> discovery-based
>>>>> protocol at all?
>>>>>
>>>>> Evgeniy, Ivan,
>>>>>
>>>>> In my particular case the data wasn’t too big. It was a slow
>> virtualised
>>>>> disk with encryption, that made operations slow. Given that there are
>> 200
>>>>> nodes in a cluster, where every node writes slowly, and this process is
>>>>> sequential, one piece of metadata is registered extremely slowly.
>>>>>
>>>>> Ivan, answering to your other questions:
>>>>>
>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>>>> accidentally?
>>>>>
>>>>> It should be checked, if it’s safe to stop writing marshaller mappings
>> to
>>>>> disk without loosing any guarantees.
>>>>> But anyway, I would like to have a property, that would control this.
>> If
>>>>> metadata registration is slow, then initial cluster warmup may take a
>>>>> while. So, if we preserve metadata on disk, then we will need to warm
>> it up
>>>>> only once, and further restarts won’t be affected.
>>>>>
>>>>>> Do we really need a fast fix here?
>>>>>
>>>>> I would like a fix, that could be implemented now, since the activity
>> with
>>>>> moving metadata to metastore doesn’t sound like a quick one. Having a
>>>>> temporary solution would be nice.
>>>>>
>>>>> Denis
>>>>>
>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
>> wrote:
>>>>>>
>>>>>> Denis,
>>>>>>
>>>>>> Several clarifying questions:
>>>>>> 1. Do you have an idea why metadata registration takes so long? So
>>>>>> poor disks? So many data to write? A contention with disk writes by
>>>>>> other subsystems?
>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>>>>> accidentally?
>>>>>>
>>>>>> Generally, I think that it is possible to move metadata saving
>>>>>> operations out of discovery thread without loosing required
>>>>>> consistency/integrity.
>>>>>>
>>>>>> As Alex mentioned using metastore looks like a better solution. Do we
>>>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>>>
>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>>>> < [hidden email] >:
>>>>>>>
>>>>>>> Alexey, but in this case customer need to be informed, that whole
>> (for
>>>>> example 1 node) cluster crash (power off) could lead to partial data
>>>>> unavailability.
>>>>>>> And may be further index corruption.
>>>>>>> 1. Why your meta takes a substantial size? may be context leaking ?
>>>>>>> 2. Could meta be compressed ?
>>>>>>>
>>>>>>>
>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>>>> [hidden email] >:
>>>>>>>>
>>>>>>>> Denis Mekhanikov,
>>>>>>>>
>>>>>>>> Currently metadata are fsync'ed on write. This might be the case of
>>>>>>>> slow-downs in case of metadata burst writes.
>>>>>>>> I think removing fsync could help to mitigate performance issues
>> with
>>>>>>>> current implementation until proper solution will be implemented:
>>>>> moving
>>>>>>>> metadata to metastore.
>>>>>>>>
>>>>>>>>
>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
>> [hidden email]
>>>>>> :
>>>>>>>>
>>>>>>>>> I would also like to mention, that marshaller mappings are written
>> to
>>>>> disk
>>>>>>>>> even if persistence is disabled.
>>>>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
>> [hidden email] >
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi!
>>>>>>>>>>
>>>>>>>>>> When persistence is enabled, binary metadata is written to disk
>> upon
>>>>>>>>> registration. Currently it happens in the discovery thread, which
>>>>> makes
>>>>>>>>> processing of related messages very slow.
>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make every
>>>>>>>>> binary type be registered for several minutes. Plus it blocks
>>>>> processing of
>>>>>>>>> other messages.
>>>>>>>>>>
>>>>>>>>>> I propose starting a separate thread that will be responsible for
>>>>>>>>> writing binary metadata to disk. So, binary type registration will
>> be
>>>>>>>>> considered finished before information about it will is written to
>>>>> disks on
>>>>>>>>> all nodes.
>>>>>>>>>>
>>>>>>>>>> The main concern here is data consistency in cases when a node
>>>>>>>>> acknowledges type registration and then fails before writing the
>>>>> metadata
>>>>>>>>> to disk.
>>>>>>>>>> I see two parts of this issue:
>>>>>>>>>> Nodes will have different metadata after restarting.
>>>>>>>>>> If we write some data into a persisted cache and shut down nodes
>>>>> faster
>>>>>>>>> than a new binary type is written to disk, then after a restart we
>>>>> won’t
>>>>>>>>> have a binary type to work with.
>>>>>>>>>>
>>>>>>>>>> The first case is similar to a situation, when one node fails, and
>>>>> after
>>>>>>>>> that a new type is registered in the cluster. This issue is
>> resolved
>>>>> by the
>>>>>>>>> discovery data exchange. All nodes receive information about all
>>>>> binary
>>>>>>>>> types in the initial discovery messages sent by other nodes. So,
>> once
>>>>> you
>>>>>>>>> restart a node, it will receive information, that it failed to
>> finish
>>>>>>>>> writing to disk, from other nodes.
>>>>>>>>>> If all nodes shut down before finishing writing the metadata to
>> disk,
>>>>>>>>> then after a restart the type will be considered unregistered, so
>>>>> another
>>>>>>>>> registration will be required.
>>>>>>>>>>
>>>>>>>>>> The second case is a bit more complicated. But it can be resolved
>> by
>>>>>>>>> making the discovery threads on every node create a future, that
>> will
>>>>> be
>>>>>>>>> completed when writing to disk is finished. So, every node will
>> have
>>>>> such
>>>>>>>>> future, that will reflect the current state of persisting the
>>>>> metadata to
>>>>>>>>> disk.
>>>>>>>>>> After that, if some operation needs this binary type, it will
>> need to
>>>>>>>>> wait on that future until flushing to disk is finished.
>>>>>>>>>> This way discovery threads won’t be blocked, but other threads,
>> that
>>>>>>>>> actually need this type, will be.
>>>>>>>>>>
>>>>>>>>>> Please let me know what you think about that.
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Alexei Scherbakov
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Zhenya Stanilovsky
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Ivan Pavlukhin
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Best regards,
>>>> Alexei Scherbakov
>>>
>>
>>
>> --
>> Zhenya Stanilovsky
>>

Eduard Shangareev

Re: Asynchronous registration of binary metadata

Denis,
How would we deal with races between registration and metadata usages with
such fast-fix?

I believe, that we need to move it to distributed metastorage, and await
registration completeness if we can't find it (wait for work in progress).
Discovery shouldn't wait for anything here.

On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <[hidden email]>
wrote:

> Sergey,
>
> Currently metadata is written to disk sequentially on every node. Only one
> node at a time is able to write metadata to its storage.
> Slowness accumulates when you add more nodes. A delay required to write
> one piece of metadata may be not that big, but if you multiply it by say
> 200, then it becomes noticeable.
> But If we move the writing out from discovery threads, then nodes will be
> doing it in parallel.
>
> I think, it’s better to block some threads from a striped pool for a
> little while rather than blocking discovery for the same period, but
> multiplied by a number of nodes.
>
> What do you think?
>
> Denis
>
> > On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]>
> wrote:
> >
> > Denis,
> >
> > Thanks for bringing this issue up, decision to write binary metadata from
> > discovery thread was really a tough decision to make.
> > I don't think that moving metadata to metastorage is a silver bullet here
> > as this approach also has its drawbacks and is not an easy change.
> >
> > In addition to workarounds suggested by Alexei we have two choices to
> > offload write operation from discovery thread:
> >
> > 1. Your scheme with a separate writer thread and futures completed when
> > write operation is finished.
> > 2. PME-like protocol with obvious complications like failover and
> > asynchronous wait for replies over communication layer.
> >
> > Your suggestion looks easier from code complexity perspective but in my
> > view it increases chances to get into starvation. Now if some node faces
> > really long delays during write op it is gonna be kicked out of topology
> by
> > discovery protocol. In your case it is possible that more and more
> threads
> > from other pools may stuck waiting on the operation future, it is also
> not
> > good.
> >
> > What do you think?
> >
> > I also think that if we want to approach this issue systematically, we
> need
> > to do a deep analysis of metastorage option as well and to finally choose
> > which road we wanna go.
> >
> > Thanks!
> >
> > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > <[hidden email]> wrote:
> >
> >>
> >>>
> >>>> 1. Yes, only on OS failures. In such case data will be received from
> >> alive
> >>>> nodes later.
> >> What behavior would be in case of one node ? I suppose someone can
> obtain
> >> cache data without unmarshalling schema, what in this case would be with
> >> grid operability?
> >>
> >>>
> >>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
> >> mode
> >>>> should not be used if you have more than two nodes in grid because it
> >> has
> >>>> huge impact on performance.
> >> Is wal mode affects metadata store ?
> >>
> >>>
> >>>>
> >>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < [hidden email]
> >>> :
> >>>>
> >>>>> Folks,
> >>>>>
> >>>>> Thanks for showing interest in this issue!
> >>>>>
> >>>>> Alexey,
> >>>>>
> >>>>>> I think removing fsync could help to mitigate performance issues
> with
> >>>>> current implementation
> >>>>>
> >>>>> Is my understanding correct, that if we remove fsync, then discovery
> >> won’t
> >>>>> be blocked, and data will be flushed to disk in background, and loss
> of
> >>>>> information will be possible only on OS failure? It sounds like an
> >>>>> acceptable workaround to me.
> >>>>>
> >>>>> Will moving metadata to metastore actually resolve this issue? Please
> >>>>> correct me if I’m wrong, but we will still need to write the
> >> information to
> >>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
> >> the
> >>>>> issue will still be there. Or is it planned to abandon the
> >> discovery-based
> >>>>> protocol at all?
> >>>>>
> >>>>> Evgeniy, Ivan,
> >>>>>
> >>>>> In my particular case the data wasn’t too big. It was a slow
> >> virtualised
> >>>>> disk with encryption, that made operations slow. Given that there are
> >> 200
> >>>>> nodes in a cluster, where every node writes slowly, and this process
> is
> >>>>> sequential, one piece of metadata is registered extremely slowly.
> >>>>>
> >>>>> Ivan, answering to your other questions:
> >>>>>
> >>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
> so
> >>>>> accidentally?
> >>>>>
> >>>>> It should be checked, if it’s safe to stop writing marshaller
> mappings
> >> to
> >>>>> disk without loosing any guarantees.
> >>>>> But anyway, I would like to have a property, that would control this.
> >> If
> >>>>> metadata registration is slow, then initial cluster warmup may take a
> >>>>> while. So, if we preserve metadata on disk, then we will need to warm
> >> it up
> >>>>> only once, and further restarts won’t be affected.
> >>>>>
> >>>>>> Do we really need a fast fix here?
> >>>>>
> >>>>> I would like a fix, that could be implemented now, since the activity
> >> with
> >>>>> moving metadata to metastore doesn’t sound like a quick one. Having a
> >>>>> temporary solution would be nice.
> >>>>>
> >>>>> Denis
> >>>>>
> >>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
> >> wrote:
> >>>>>>
> >>>>>> Denis,
> >>>>>>
> >>>>>> Several clarifying questions:
> >>>>>> 1. Do you have an idea why metadata registration takes so long? So
> >>>>>> poor disks? So many data to write? A contention with disk writes by
> >>>>>> other subsystems?
> >>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
> so
> >>>>>> accidentally?
> >>>>>>
> >>>>>> Generally, I think that it is possible to move metadata saving
> >>>>>> operations out of discovery thread without loosing required
> >>>>>> consistency/integrity.
> >>>>>>
> >>>>>> As Alex mentioned using metastore looks like a better solution. Do
> we
> >>>>>> really need a fast fix here? (Are we talking about fast fix?)
> >>>>>>
> >>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>>>> < [hidden email] >:
> >>>>>>>
> >>>>>>> Alexey, but in this case customer need to be informed, that whole
> >> (for
> >>>>> example 1 node) cluster crash (power off) could lead to partial data
> >>>>> unavailability.
> >>>>>>> And may be further index corruption.
> >>>>>>> 1. Why your meta takes a substantial size? may be context leaking ?
> >>>>>>> 2. Could meta be compressed ?
> >>>>>>>
> >>>>>>>
> >>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> >>>>> [hidden email] >:
> >>>>>>>>
> >>>>>>>> Denis Mekhanikov,
> >>>>>>>>
> >>>>>>>> Currently metadata are fsync'ed on write. This might be the case
> of
> >>>>>>>> slow-downs in case of metadata burst writes.
> >>>>>>>> I think removing fsync could help to mitigate performance issues
> >> with
> >>>>>>>> current implementation until proper solution will be implemented:
> >>>>> moving
> >>>>>>>> metadata to metastore.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> >> [hidden email]
> >>>>>> :
> >>>>>>>>
> >>>>>>>>> I would also like to mention, that marshaller mappings are
> written
> >> to
> >>>>> disk
> >>>>>>>>> even if persistence is disabled.
> >>>>>>>>> So, this issue affects purely in-memory clusters as well.
> >>>>>>>>>
> >>>>>>>>> Denis
> >>>>>>>>>
> >>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> >> [hidden email] >
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi!
> >>>>>>>>>>
> >>>>>>>>>> When persistence is enabled, binary metadata is written to disk
> >> upon
> >>>>>>>>> registration. Currently it happens in the discovery thread, which
> >>>>> makes
> >>>>>>>>> processing of related messages very slow.
> >>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
> every
> >>>>>>>>> binary type be registered for several minutes. Plus it blocks
> >>>>> processing of
> >>>>>>>>> other messages.
> >>>>>>>>>>
> >>>>>>>>>> I propose starting a separate thread that will be responsible
> for
> >>>>>>>>> writing binary metadata to disk. So, binary type registration
> will
> >> be
> >>>>>>>>> considered finished before information about it will is written
> to
> >>>>> disks on
> >>>>>>>>> all nodes.
> >>>>>>>>>>
> >>>>>>>>>> The main concern here is data consistency in cases when a node
> >>>>>>>>> acknowledges type registration and then fails before writing the
> >>>>> metadata
> >>>>>>>>> to disk.
> >>>>>>>>>> I see two parts of this issue:
> >>>>>>>>>> Nodes will have different metadata after restarting.
> >>>>>>>>>> If we write some data into a persisted cache and shut down nodes
> >>>>> faster
> >>>>>>>>> than a new binary type is written to disk, then after a restart
> we
> >>>>> won’t
> >>>>>>>>> have a binary type to work with.
> >>>>>>>>>>
> >>>>>>>>>> The first case is similar to a situation, when one node fails,
> and
> >>>>> after
> >>>>>>>>> that a new type is registered in the cluster. This issue is
> >> resolved
> >>>>> by the
> >>>>>>>>> discovery data exchange. All nodes receive information about all
> >>>>> binary
> >>>>>>>>> types in the initial discovery messages sent by other nodes. So,
> >> once
> >>>>> you
> >>>>>>>>> restart a node, it will receive information, that it failed to
> >> finish
> >>>>>>>>> writing to disk, from other nodes.
> >>>>>>>>>> If all nodes shut down before finishing writing the metadata to
> >> disk,
> >>>>>>>>> then after a restart the type will be considered unregistered, so
> >>>>> another
> >>>>>>>>> registration will be required.
> >>>>>>>>>>
> >>>>>>>>>> The second case is a bit more complicated. But it can be
> resolved
> >> by
> >>>>>>>>> making the discovery threads on every node create a future, that
> >> will
> >>>>> be
> >>>>>>>>> completed when writing to disk is finished. So, every node will
> >> have
> >>>>> such
> >>>>>>>>> future, that will reflect the current state of persisting the
> >>>>> metadata to
> >>>>>>>>> disk.
> >>>>>>>>>> After that, if some operation needs this binary type, it will
> >> need to
> >>>>>>>>> wait on that future until flushing to disk is finished.
> >>>>>>>>>> This way discovery threads won’t be blocked, but other threads,
> >> that
> >>>>>>>>> actually need this type, will be.
> >>>>>>>>>>
> >>>>>>>>>> Please let me know what you think about that.
> >>>>>>>>>>
> >>>>>>>>>> Denis
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Alexei Scherbakov
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Zhenya Stanilovsky
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best regards,
> >>>>>> Ivan Pavlukhin
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>>
> >>>> Best regards,
> >>>> Alexei Scherbakov
> >>>
> >>
> >>
> >> --
> >> Zhenya Stanilovsky
> >>
>
>

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Eduard,

Usages will wait for the metadata to be registered and written to disk. No races should occur with such flow.
Or do you have some specific case on your mind?

I agree, that using a distributed meta storage would be nice here.
But this way we will kind of move to the previous scheme with a replicated system cache, where metadata was stored before.
Will scheme with the metastorage be different in any way? Won’t we decide to move back to discovery messages again after a while?

Denis

> On 20 Aug 2019, at 15:13, Eduard Shangareev <[hidden email]> wrote:
>
> Denis,
> How would we deal with races between registration and metadata usages with
> such fast-fix?
>
> I believe, that we need to move it to distributed metastorage, and await
> registration completeness if we can't find it (wait for work in progress).
> Discovery shouldn't wait for anything here.
>
> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <[hidden email]>
> wrote:
>
>> Sergey,
>>
>> Currently metadata is written to disk sequentially on every node. Only one
>> node at a time is able to write metadata to its storage.
>> Slowness accumulates when you add more nodes. A delay required to write
>> one piece of metadata may be not that big, but if you multiply it by say
>> 200, then it becomes noticeable.
>> But If we move the writing out from discovery threads, then nodes will be
>> doing it in parallel.
>>
>> I think, it’s better to block some threads from a striped pool for a
>> little while rather than blocking discovery for the same period, but
>> multiplied by a number of nodes.
>>
>> What do you think?
>>
>> Denis
>>
>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]>
>> wrote:
>>>
>>> Denis,
>>>
>>> Thanks for bringing this issue up, decision to write binary metadata from
>>> discovery thread was really a tough decision to make.
>>> I don't think that moving metadata to metastorage is a silver bullet here
>>> as this approach also has its drawbacks and is not an easy change.
>>>
>>> In addition to workarounds suggested by Alexei we have two choices to
>>> offload write operation from discovery thread:
>>>
>>> 1. Your scheme with a separate writer thread and futures completed when
>>> write operation is finished.
>>> 2. PME-like protocol with obvious complications like failover and
>>> asynchronous wait for replies over communication layer.
>>>
>>> Your suggestion looks easier from code complexity perspective but in my
>>> view it increases chances to get into starvation. Now if some node faces
>>> really long delays during write op it is gonna be kicked out of topology
>> by
>>> discovery protocol. In your case it is possible that more and more
>> threads
>>> from other pools may stuck waiting on the operation future, it is also
>> not
>>> good.
>>>
>>> What do you think?
>>>
>>> I also think that if we want to approach this issue systematically, we
>> need
>>> to do a deep analysis of metastorage option as well and to finally choose
>>> which road we wanna go.
>>>
>>> Thanks!
>>>
>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
>>> <[hidden email]> wrote:
>>>
>>>>
>>>>>
>>>>>> 1. Yes, only on OS failures. In such case data will be received from
>>>> alive
>>>>>> nodes later.
>>>> What behavior would be in case of one node ? I suppose someone can
>> obtain
>>>> cache data without unmarshalling schema, what in this case would be with
>>>> grid operability?
>>>>
>>>>>
>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
>>>> mode
>>>>>> should not be used if you have more than two nodes in grid because it
>>>> has
>>>>>> huge impact on performance.
>>>> Is wal mode affects metadata store ?
>>>>
>>>>>
>>>>>>
>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < [hidden email]
>>>>> :
>>>>>>
>>>>>>> Folks,
>>>>>>>
>>>>>>> Thanks for showing interest in this issue!
>>>>>>>
>>>>>>> Alexey,
>>>>>>>
>>>>>>>> I think removing fsync could help to mitigate performance issues
>> with
>>>>>>> current implementation
>>>>>>>
>>>>>>> Is my understanding correct, that if we remove fsync, then discovery
>>>> won’t
>>>>>>> be blocked, and data will be flushed to disk in background, and loss
>> of
>>>>>>> information will be possible only on OS failure? It sounds like an
>>>>>>> acceptable workaround to me.
>>>>>>>
>>>>>>> Will moving metadata to metastore actually resolve this issue? Please
>>>>>>> correct me if I’m wrong, but we will still need to write the
>>>> information to
>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
>>>> the
>>>>>>> issue will still be there. Or is it planned to abandon the
>>>> discovery-based
>>>>>>> protocol at all?
>>>>>>>
>>>>>>> Evgeniy, Ivan,
>>>>>>>
>>>>>>> In my particular case the data wasn’t too big. It was a slow
>>>> virtualised
>>>>>>> disk with encryption, that made operations slow. Given that there are
>>>> 200
>>>>>>> nodes in a cluster, where every node writes slowly, and this process
>> is
>>>>>>> sequential, one piece of metadata is registered extremely slowly.
>>>>>>>
>>>>>>> Ivan, answering to your other questions:
>>>>>>>
>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>> so
>>>>>>> accidentally?
>>>>>>>
>>>>>>> It should be checked, if it’s safe to stop writing marshaller
>> mappings
>>>> to
>>>>>>> disk without loosing any guarantees.
>>>>>>> But anyway, I would like to have a property, that would control this.
>>>> If
>>>>>>> metadata registration is slow, then initial cluster warmup may take a
>>>>>>> while. So, if we preserve metadata on disk, then we will need to warm
>>>> it up
>>>>>>> only once, and further restarts won’t be affected.
>>>>>>>
>>>>>>>> Do we really need a fast fix here?
>>>>>>>
>>>>>>> I would like a fix, that could be implemented now, since the activity
>>>> with
>>>>>>> moving metadata to metastore doesn’t sound like a quick one. Having a
>>>>>>> temporary solution would be nice.
>>>>>>>
>>>>>>> Denis
>>>>>>>
>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
>>>> wrote:
>>>>>>>>
>>>>>>>> Denis,
>>>>>>>>
>>>>>>>> Several clarifying questions:
>>>>>>>> 1. Do you have an idea why metadata registration takes so long? So
>>>>>>>> poor disks? So many data to write? A contention with disk writes by
>>>>>>>> other subsystems?
>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>> so
>>>>>>>> accidentally?
>>>>>>>>
>>>>>>>> Generally, I think that it is possible to move metadata saving
>>>>>>>> operations out of discovery thread without loosing required
>>>>>>>> consistency/integrity.
>>>>>>>>
>>>>>>>> As Alex mentioned using metastore looks like a better solution. Do
>> we
>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>>>>>
>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>>>>>> < [hidden email] >:
>>>>>>>>>
>>>>>>>>> Alexey, but in this case customer need to be informed, that whole
>>>> (for
>>>>>>> example 1 node) cluster crash (power off) could lead to partial data
>>>>>>> unavailability.
>>>>>>>>> And may be further index corruption.
>>>>>>>>> 1. Why your meta takes a substantial size? may be context leaking ?
>>>>>>>>> 2. Could meta be compressed ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>>>>>> [hidden email] >:
>>>>>>>>>>
>>>>>>>>>> Denis Mekhanikov,
>>>>>>>>>>
>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case
>> of
>>>>>>>>>> slow-downs in case of metadata burst writes.
>>>>>>>>>> I think removing fsync could help to mitigate performance issues
>>>> with
>>>>>>>>>> current implementation until proper solution will be implemented:
>>>>>>> moving
>>>>>>>>>> metadata to metastore.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
>>>> [hidden email]
>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>>> I would also like to mention, that marshaller mappings are
>> written
>>>> to
>>>>>>> disk
>>>>>>>>>>> even if persistence is disabled.
>>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
>>>> [hidden email] >
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi!
>>>>>>>>>>>>
>>>>>>>>>>>> When persistence is enabled, binary metadata is written to disk
>>>> upon
>>>>>>>>>>> registration. Currently it happens in the discovery thread, which
>>>>>>> makes
>>>>>>>>>>> processing of related messages very slow.
>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
>> every
>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
>>>>>>> processing of
>>>>>>>>>>> other messages.
>>>>>>>>>>>>
>>>>>>>>>>>> I propose starting a separate thread that will be responsible
>> for
>>>>>>>>>>> writing binary metadata to disk. So, binary type registration
>> will
>>>> be
>>>>>>>>>>> considered finished before information about it will is written
>> to
>>>>>>> disks on
>>>>>>>>>>> all nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> The main concern here is data consistency in cases when a node
>>>>>>>>>>> acknowledges type registration and then fails before writing the
>>>>>>> metadata
>>>>>>>>>>> to disk.
>>>>>>>>>>>> I see two parts of this issue:
>>>>>>>>>>>> Nodes will have different metadata after restarting.
>>>>>>>>>>>> If we write some data into a persisted cache and shut down nodes
>>>>>>> faster
>>>>>>>>>>> than a new binary type is written to disk, then after a restart
>> we
>>>>>>> won’t
>>>>>>>>>>> have a binary type to work with.
>>>>>>>>>>>>
>>>>>>>>>>>> The first case is similar to a situation, when one node fails,
>> and
>>>>>>> after
>>>>>>>>>>> that a new type is registered in the cluster. This issue is
>>>> resolved
>>>>>>> by the
>>>>>>>>>>> discovery data exchange. All nodes receive information about all
>>>>>>> binary
>>>>>>>>>>> types in the initial discovery messages sent by other nodes. So,
>>>> once
>>>>>>> you
>>>>>>>>>>> restart a node, it will receive information, that it failed to
>>>> finish
>>>>>>>>>>> writing to disk, from other nodes.
>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata to
>>>> disk,
>>>>>>>>>>> then after a restart the type will be considered unregistered, so
>>>>>>> another
>>>>>>>>>>> registration will be required.
>>>>>>>>>>>>
>>>>>>>>>>>> The second case is a bit more complicated. But it can be
>> resolved
>>>> by
>>>>>>>>>>> making the discovery threads on every node create a future, that
>>>> will
>>>>>>> be
>>>>>>>>>>> completed when writing to disk is finished. So, every node will
>>>> have
>>>>>>> such
>>>>>>>>>>> future, that will reflect the current state of persisting the
>>>>>>> metadata to
>>>>>>>>>>> disk.
>>>>>>>>>>>> After that, if some operation needs this binary type, it will
>>>> need to
>>>>>>>>>>> wait on that future until flushing to disk is finished.
>>>>>>>>>>>> This way discovery threads won’t be blocked, but other threads,
>>>> that
>>>>>>>>>>> actually need this type, will be.
>>>>>>>>>>>>
>>>>>>>>>>>> Please let me know what you think about that.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Alexei Scherbakov
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Zhenya Stanilovsky
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Ivan Pavlukhin
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Best regards,
>>>>>> Alexei Scherbakov
>>>>>
>>>>
>>>>
>>>> --
>>>> Zhenya Stanilovsky
>>>>
>>
>>

Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Denis Mekhanikov,

If we are still talking about "proper" solution the metastore (I've meant
of course distributed one) is the way to go.

It has a contract to store cluster wide metadata in most efficient way and
can have any optimization for concurrent writing inside.

I'm against creating some duplicating mechanism as you suggested. We do not
need another copy/paste code.

Another possibility is to carry metadata along with appropriate request if
it's not found locally but this is a rather big modification.

вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <[hidden email]>:

> Eduard,
>
> Usages will wait for the metadata to be registered and written to disk. No
> races should occur with such flow.
> Or do you have some specific case on your mind?
>
> I agree, that using a distributed meta storage would be nice here.
> But this way we will kind of move to the previous scheme with a replicated
> system cache, where metadata was stored before.
> Will scheme with the metastorage be different in any way? Won’t we decide
> to move back to discovery messages again after a while?
>
> Denis
>
>
> > On 20 Aug 2019, at 15:13, Eduard Shangareev <[hidden email]>
> wrote:
> >
> > Denis,
> > How would we deal with races between registration and metadata usages
> with
> > such fast-fix?
> >
> > I believe, that we need to move it to distributed metastorage, and await
> > registration completeness if we can't find it (wait for work in
> progress).
> > Discovery shouldn't wait for anything here.
> >
> > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <[hidden email]
> >
> > wrote:
> >
> >> Sergey,
> >>
> >> Currently metadata is written to disk sequentially on every node. Only
> one
> >> node at a time is able to write metadata to its storage.
> >> Slowness accumulates when you add more nodes. A delay required to write
> >> one piece of metadata may be not that big, but if you multiply it by say
> >> 200, then it becomes noticeable.
> >> But If we move the writing out from discovery threads, then nodes will
> be
> >> doing it in parallel.
> >>
> >> I think, it’s better to block some threads from a striped pool for a
> >> little while rather than blocking discovery for the same period, but
> >> multiplied by a number of nodes.
> >>
> >> What do you think?
> >>
> >> Denis
> >>
> >>> On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]>
> >> wrote:
> >>>
> >>> Denis,
> >>>
> >>> Thanks for bringing this issue up, decision to write binary metadata
> from
> >>> discovery thread was really a tough decision to make.
> >>> I don't think that moving metadata to metastorage is a silver bullet
> here
> >>> as this approach also has its drawbacks and is not an easy change.
> >>>
> >>> In addition to workarounds suggested by Alexei we have two choices to
> >>> offload write operation from discovery thread:
> >>>
> >>> 1. Your scheme with a separate writer thread and futures completed
> when
> >>> write operation is finished.
> >>> 2. PME-like protocol with obvious complications like failover and
> >>> asynchronous wait for replies over communication layer.
> >>>
> >>> Your suggestion looks easier from code complexity perspective but in my
> >>> view it increases chances to get into starvation. Now if some node
> faces
> >>> really long delays during write op it is gonna be kicked out of
> topology
> >> by
> >>> discovery protocol. In your case it is possible that more and more
> >> threads
> >>> from other pools may stuck waiting on the operation future, it is also
> >> not
> >>> good.
> >>>
> >>> What do you think?
> >>>
> >>> I also think that if we want to approach this issue systematically, we
> >> need
> >>> to do a deep analysis of metastorage option as well and to finally
> choose
> >>> which road we wanna go.
> >>>
> >>> Thanks!
> >>>
> >>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> >>> <[hidden email]> wrote:
> >>>
> >>>>
> >>>>>
> >>>>>> 1. Yes, only on OS failures. In such case data will be received from
> >>>> alive
> >>>>>> nodes later.
> >>>> What behavior would be in case of one node ? I suppose someone can
> >> obtain
> >>>> cache data without unmarshalling schema, what in this case would be
> with
> >>>> grid operability?
> >>>>
> >>>>>
> >>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
> >>>> mode
> >>>>>> should not be used if you have more than two nodes in grid because
> it
> >>>> has
> >>>>>> huge impact on performance.
> >>>> Is wal mode affects metadata store ?
> >>>>
> >>>>>
> >>>>>>
> >>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> [hidden email]
> >>>>> :
> >>>>>>
> >>>>>>> Folks,
> >>>>>>>
> >>>>>>> Thanks for showing interest in this issue!
> >>>>>>>
> >>>>>>> Alexey,
> >>>>>>>
> >>>>>>>> I think removing fsync could help to mitigate performance issues
> >> with
> >>>>>>> current implementation
> >>>>>>>
> >>>>>>> Is my understanding correct, that if we remove fsync, then
> discovery
> >>>> won’t
> >>>>>>> be blocked, and data will be flushed to disk in background, and
> loss
> >> of
> >>>>>>> information will be possible only on OS failure? It sounds like an
> >>>>>>> acceptable workaround to me.
> >>>>>>>
> >>>>>>> Will moving metadata to metastore actually resolve this issue?
> Please
> >>>>>>> correct me if I’m wrong, but we will still need to write the
> >>>> information to
> >>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC,
> then
> >>>> the
> >>>>>>> issue will still be there. Or is it planned to abandon the
> >>>> discovery-based
> >>>>>>> protocol at all?
> >>>>>>>
> >>>>>>> Evgeniy, Ivan,
> >>>>>>>
> >>>>>>> In my particular case the data wasn’t too big. It was a slow
> >>>> virtualised
> >>>>>>> disk with encryption, that made operations slow. Given that there
> are
> >>>> 200
> >>>>>>> nodes in a cluster, where every node writes slowly, and this
> process
> >> is
> >>>>>>> sequential, one piece of metadata is registered extremely slowly.
> >>>>>>>
> >>>>>>> Ivan, answering to your other questions:
> >>>>>>>
> >>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
> >> so
> >>>>>>> accidentally?
> >>>>>>>
> >>>>>>> It should be checked, if it’s safe to stop writing marshaller
> >> mappings
> >>>> to
> >>>>>>> disk without loosing any guarantees.
> >>>>>>> But anyway, I would like to have a property, that would control
> this.
> >>>> If
> >>>>>>> metadata registration is slow, then initial cluster warmup may
> take a
> >>>>>>> while. So, if we preserve metadata on disk, then we will need to
> warm
> >>>> it up
> >>>>>>> only once, and further restarts won’t be affected.
> >>>>>>>
> >>>>>>>> Do we really need a fast fix here?
> >>>>>>>
> >>>>>>> I would like a fix, that could be implemented now, since the
> activity
> >>>> with
> >>>>>>> moving metadata to metastore doesn’t sound like a quick one.
> Having a
> >>>>>>> temporary solution would be nice.
> >>>>>>>
> >>>>>>> Denis
> >>>>>>>
> >>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Denis,
> >>>>>>>>
> >>>>>>>> Several clarifying questions:
> >>>>>>>> 1. Do you have an idea why metadata registration takes so long? So
> >>>>>>>> poor disks? So many data to write? A contention with disk writes
> by
> >>>>>>>> other subsystems?
> >>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
> >> so
> >>>>>>>> accidentally?
> >>>>>>>>
> >>>>>>>> Generally, I think that it is possible to move metadata saving
> >>>>>>>> operations out of discovery thread without loosing required
> >>>>>>>> consistency/integrity.
> >>>>>>>>
> >>>>>>>> As Alex mentioned using metastore looks like a better solution. Do
> >> we
> >>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
> >>>>>>>>
> >>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>>>>>> < [hidden email] >:
> >>>>>>>>>
> >>>>>>>>> Alexey, but in this case customer need to be informed, that whole
> >>>> (for
> >>>>>>> example 1 node) cluster crash (power off) could lead to partial
> data
> >>>>>>> unavailability.
> >>>>>>>>> And may be further index corruption.
> >>>>>>>>> 1. Why your meta takes a substantial size? may be context
> leaking ?
> >>>>>>>>> 2. Could meta be compressed ?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> >>>>>>> [hidden email] >:
> >>>>>>>>>>
> >>>>>>>>>> Denis Mekhanikov,
> >>>>>>>>>>
> >>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case
> >> of
> >>>>>>>>>> slow-downs in case of metadata burst writes.
> >>>>>>>>>> I think removing fsync could help to mitigate performance issues
> >>>> with
> >>>>>>>>>> current implementation until proper solution will be
> implemented:
> >>>>>>> moving
> >>>>>>>>>> metadata to metastore.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> >>>> [hidden email]
> >>>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>> I would also like to mention, that marshaller mappings are
> >> written
> >>>> to
> >>>>>>> disk
> >>>>>>>>>>> even if persistence is disabled.
> >>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
> >>>>>>>>>>>
> >>>>>>>>>>> Denis
> >>>>>>>>>>>
> >>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> >>>> [hidden email] >
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi!
> >>>>>>>>>>>>
> >>>>>>>>>>>> When persistence is enabled, binary metadata is written to
> disk
> >>>> upon
> >>>>>>>>>>> registration. Currently it happens in the discovery thread,
> which
> >>>>>>> makes
> >>>>>>>>>>> processing of related messages very slow.
> >>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
> >> every
> >>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
> >>>>>>> processing of
> >>>>>>>>>>> other messages.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I propose starting a separate thread that will be responsible
> >> for
> >>>>>>>>>>> writing binary metadata to disk. So, binary type registration
> >> will
> >>>> be
> >>>>>>>>>>> considered finished before information about it will is written
> >> to
> >>>>>>> disks on
> >>>>>>>>>>> all nodes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The main concern here is data consistency in cases when a node
> >>>>>>>>>>> acknowledges type registration and then fails before writing
> the
> >>>>>>> metadata
> >>>>>>>>>>> to disk.
> >>>>>>>>>>>> I see two parts of this issue:
> >>>>>>>>>>>> Nodes will have different metadata after restarting.
> >>>>>>>>>>>> If we write some data into a persisted cache and shut down
> nodes
> >>>>>>> faster
> >>>>>>>>>>> than a new binary type is written to disk, then after a restart
> >> we
> >>>>>>> won’t
> >>>>>>>>>>> have a binary type to work with.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first case is similar to a situation, when one node fails,
> >> and
> >>>>>>> after
> >>>>>>>>>>> that a new type is registered in the cluster. This issue is
> >>>> resolved
> >>>>>>> by the
> >>>>>>>>>>> discovery data exchange. All nodes receive information about
> all
> >>>>>>> binary
> >>>>>>>>>>> types in the initial discovery messages sent by other nodes.
> So,
> >>>> once
> >>>>>>> you
> >>>>>>>>>>> restart a node, it will receive information, that it failed to
> >>>> finish
> >>>>>>>>>>> writing to disk, from other nodes.
> >>>>>>>>>>>> If all nodes shut down before finishing writing the metadata
> to
> >>>> disk,
> >>>>>>>>>>> then after a restart the type will be considered unregistered,
> so
> >>>>>>> another
> >>>>>>>>>>> registration will be required.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The second case is a bit more complicated. But it can be
> >> resolved
> >>>> by
> >>>>>>>>>>> making the discovery threads on every node create a future,
> that
> >>>> will
> >>>>>>> be
> >>>>>>>>>>> completed when writing to disk is finished. So, every node will
> >>>> have
> >>>>>>> such
> >>>>>>>>>>> future, that will reflect the current state of persisting the
> >>>>>>> metadata to
> >>>>>>>>>>> disk.
> >>>>>>>>>>>> After that, if some operation needs this binary type, it will
> >>>> need to
> >>>>>>>>>>> wait on that future until flushing to disk is finished.
> >>>>>>>>>>>> This way discovery threads won’t be blocked, but other
> threads,
> >>>> that
> >>>>>>>>>>> actually need this type, will be.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please let me know what you think about that.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Denis
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Alexei Scherbakov
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Zhenya Stanilovsky
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best regards,
> >>>>>>>> Ivan Pavlukhin
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Alexei Scherbakov
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Zhenya Stanilovsky
> >>>>
> >>
> >>
>
>

--

Best regards,
Alexei Scherbakov

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Alexey,

I’m not suggesting to duplicate anything.
My point is that the proper fix will be implemented in a relatively distant future. Why not improve the existing mechanism now instead of waiting for the proper fix?
If we don’t agree on doing this fix in master, I can do it in a fork and use it in my setup. So please let me know if you see any other drawbacks in the proposed solution.

Denis

> On 21 Aug 2019, at 15:53, Alexei Scherbakov <[hidden email]> wrote:
>
> Denis Mekhanikov,
>
> If we are still talking about "proper" solution the metastore (I've meant
> of course distributed one) is the way to go.
>
> It has a contract to store cluster wide metadata in most efficient way and
> can have any optimization for concurrent writing inside.
>
> I'm against creating some duplicating mechanism as you suggested. We do not
> need another copy/paste code.
>
> Another possibility is to carry metadata along with appropriate request if
> it's not found locally but this is a rather big modification.
>
>
>
> вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <[hidden email]>:
>
>> Eduard,
>>
>> Usages will wait for the metadata to be registered and written to disk. No
>> races should occur with such flow.
>> Or do you have some specific case on your mind?
>>
>> I agree, that using a distributed meta storage would be nice here.
>> But this way we will kind of move to the previous scheme with a replicated
>> system cache, where metadata was stored before.
>> Will scheme with the metastorage be different in any way? Won’t we decide
>> to move back to discovery messages again after a while?
>>
>> Denis
>>
>>
>>> On 20 Aug 2019, at 15:13, Eduard Shangareev <[hidden email]>
>> wrote:
>>>
>>> Denis,
>>> How would we deal with races between registration and metadata usages
>> with
>>> such fast-fix?
>>>
>>> I believe, that we need to move it to distributed metastorage, and await
>>> registration completeness if we can't find it (wait for work in
>> progress).
>>> Discovery shouldn't wait for anything here.
>>>
>>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <[hidden email]
>>>
>>> wrote:
>>>
>>>> Sergey,
>>>>
>>>> Currently metadata is written to disk sequentially on every node. Only
>> one
>>>> node at a time is able to write metadata to its storage.
>>>> Slowness accumulates when you add more nodes. A delay required to write
>>>> one piece of metadata may be not that big, but if you multiply it by say
>>>> 200, then it becomes noticeable.
>>>> But If we move the writing out from discovery threads, then nodes will
>> be
>>>> doing it in parallel.
>>>>
>>>> I think, it’s better to block some threads from a striped pool for a
>>>> little while rather than blocking discovery for the same period, but
>>>> multiplied by a number of nodes.
>>>>
>>>> What do you think?
>>>>
>>>> Denis
>>>>
>>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Denis,
>>>>>
>>>>> Thanks for bringing this issue up, decision to write binary metadata
>> from
>>>>> discovery thread was really a tough decision to make.
>>>>> I don't think that moving metadata to metastorage is a silver bullet
>> here
>>>>> as this approach also has its drawbacks and is not an easy change.
>>>>>
>>>>> In addition to workarounds suggested by Alexei we have two choices to
>>>>> offload write operation from discovery thread:
>>>>>
>>>>> 1. Your scheme with a separate writer thread and futures completed
>> when
>>>>> write operation is finished.
>>>>> 2. PME-like protocol with obvious complications like failover and
>>>>> asynchronous wait for replies over communication layer.
>>>>>
>>>>> Your suggestion looks easier from code complexity perspective but in my
>>>>> view it increases chances to get into starvation. Now if some node
>> faces
>>>>> really long delays during write op it is gonna be kicked out of
>> topology
>>>> by
>>>>> discovery protocol. In your case it is possible that more and more
>>>> threads
>>>>> from other pools may stuck waiting on the operation future, it is also
>>>> not
>>>>> good.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> I also think that if we want to approach this issue systematically, we
>>>> need
>>>>> to do a deep analysis of metastorage option as well and to finally
>> choose
>>>>> which road we wanna go.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
>>>>> <[hidden email]> wrote:
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> 1. Yes, only on OS failures. In such case data will be received from
>>>>>> alive
>>>>>>>> nodes later.
>>>>>> What behavior would be in case of one node ? I suppose someone can
>>>> obtain
>>>>>> cache data without unmarshalling schema, what in this case would be
>> with
>>>>>> grid operability?
>>>>>>
>>>>>>>
>>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
>>>>>> mode
>>>>>>>> should not be used if you have more than two nodes in grid because
>> it
>>>>>> has
>>>>>>>> huge impact on performance.
>>>>>> Is wal mode affects metadata store ?
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
>> [hidden email]
>>>>>>> :
>>>>>>>>
>>>>>>>>> Folks,
>>>>>>>>>
>>>>>>>>> Thanks for showing interest in this issue!
>>>>>>>>>
>>>>>>>>> Alexey,
>>>>>>>>>
>>>>>>>>>> I think removing fsync could help to mitigate performance issues
>>>> with
>>>>>>>>> current implementation
>>>>>>>>>
>>>>>>>>> Is my understanding correct, that if we remove fsync, then
>> discovery
>>>>>> won’t
>>>>>>>>> be blocked, and data will be flushed to disk in background, and
>> loss
>>>> of
>>>>>>>>> information will be possible only on OS failure? It sounds like an
>>>>>>>>> acceptable workaround to me.
>>>>>>>>>
>>>>>>>>> Will moving metadata to metastore actually resolve this issue?
>> Please
>>>>>>>>> correct me if I’m wrong, but we will still need to write the
>>>>>> information to
>>>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC,
>> then
>>>>>> the
>>>>>>>>> issue will still be there. Or is it planned to abandon the
>>>>>> discovery-based
>>>>>>>>> protocol at all?
>>>>>>>>>
>>>>>>>>> Evgeniy, Ivan,
>>>>>>>>>
>>>>>>>>> In my particular case the data wasn’t too big. It was a slow
>>>>>> virtualised
>>>>>>>>> disk with encryption, that made operations slow. Given that there
>> are
>>>>>> 200
>>>>>>>>> nodes in a cluster, where every node writes slowly, and this
>> process
>>>> is
>>>>>>>>> sequential, one piece of metadata is registered extremely slowly.
>>>>>>>>>
>>>>>>>>> Ivan, answering to your other questions:
>>>>>>>>>
>>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>>>> so
>>>>>>>>> accidentally?
>>>>>>>>>
>>>>>>>>> It should be checked, if it’s safe to stop writing marshaller
>>>> mappings
>>>>>> to
>>>>>>>>> disk without loosing any guarantees.
>>>>>>>>> But anyway, I would like to have a property, that would control
>> this.
>>>>>> If
>>>>>>>>> metadata registration is slow, then initial cluster warmup may
>> take a
>>>>>>>>> while. So, if we preserve metadata on disk, then we will need to
>> warm
>>>>>> it up
>>>>>>>>> only once, and further restarts won’t be affected.
>>>>>>>>>
>>>>>>>>>> Do we really need a fast fix here?
>>>>>>>>>
>>>>>>>>> I would like a fix, that could be implemented now, since the
>> activity
>>>>>> with
>>>>>>>>> moving metadata to metastore doesn’t sound like a quick one.
>> Having a
>>>>>>>>> temporary solution would be nice.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Denis,
>>>>>>>>>>
>>>>>>>>>> Several clarifying questions:
>>>>>>>>>> 1. Do you have an idea why metadata registration takes so long? So
>>>>>>>>>> poor disks? So many data to write? A contention with disk writes
>> by
>>>>>>>>>> other subsystems?
>>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>>>> so
>>>>>>>>>> accidentally?
>>>>>>>>>>
>>>>>>>>>> Generally, I think that it is possible to move metadata saving
>>>>>>>>>> operations out of discovery thread without loosing required
>>>>>>>>>> consistency/integrity.
>>>>>>>>>>
>>>>>>>>>> As Alex mentioned using metastore looks like a better solution. Do
>>>> we
>>>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>>>>>>>
>>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>>>>>>>> < [hidden email] >:
>>>>>>>>>>>
>>>>>>>>>>> Alexey, but in this case customer need to be informed, that whole
>>>>>> (for
>>>>>>>>> example 1 node) cluster crash (power off) could lead to partial
>> data
>>>>>>>>> unavailability.
>>>>>>>>>>> And may be further index corruption.
>>>>>>>>>>> 1. Why your meta takes a substantial size? may be context
>> leaking ?
>>>>>>>>>>> 2. Could meta be compressed ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>>>>>>>> [hidden email] >:
>>>>>>>>>>>>
>>>>>>>>>>>> Denis Mekhanikov,
>>>>>>>>>>>>
>>>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case
>>>> of
>>>>>>>>>>>> slow-downs in case of metadata burst writes.
>>>>>>>>>>>> I think removing fsync could help to mitigate performance issues
>>>>>> with
>>>>>>>>>>>> current implementation until proper solution will be
>> implemented:
>>>>>>>>> moving
>>>>>>>>>>>> metadata to metastore.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
>>>>>> [hidden email]
>>>>>>>>>> :
>>>>>>>>>>>>
>>>>>>>>>>>>> I would also like to mention, that marshaller mappings are
>>>> written
>>>>>> to
>>>>>>>>> disk
>>>>>>>>>>>>> even if persistence is disabled.
>>>>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
>>>>>> [hidden email] >
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When persistence is enabled, binary metadata is written to
>> disk
>>>>>> upon
>>>>>>>>>>>>> registration. Currently it happens in the discovery thread,
>> which
>>>>>>>>> makes
>>>>>>>>>>>>> processing of related messages very slow.
>>>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
>>>> every
>>>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
>>>>>>>>> processing of
>>>>>>>>>>>>> other messages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I propose starting a separate thread that will be responsible
>>>> for
>>>>>>>>>>>>> writing binary metadata to disk. So, binary type registration
>>>> will
>>>>>> be
>>>>>>>>>>>>> considered finished before information about it will is written
>>>> to
>>>>>>>>> disks on
>>>>>>>>>>>>> all nodes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The main concern here is data consistency in cases when a node
>>>>>>>>>>>>> acknowledges type registration and then fails before writing
>> the
>>>>>>>>> metadata
>>>>>>>>>>>>> to disk.
>>>>>>>>>>>>>> I see two parts of this issue:
>>>>>>>>>>>>>> Nodes will have different metadata after restarting.
>>>>>>>>>>>>>> If we write some data into a persisted cache and shut down
>> nodes
>>>>>>>>> faster
>>>>>>>>>>>>> than a new binary type is written to disk, then after a restart
>>>> we
>>>>>>>>> won’t
>>>>>>>>>>>>> have a binary type to work with.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The first case is similar to a situation, when one node fails,
>>>> and
>>>>>>>>> after
>>>>>>>>>>>>> that a new type is registered in the cluster. This issue is
>>>>>> resolved
>>>>>>>>> by the
>>>>>>>>>>>>> discovery data exchange. All nodes receive information about
>> all
>>>>>>>>> binary
>>>>>>>>>>>>> types in the initial discovery messages sent by other nodes.
>> So,
>>>>>> once
>>>>>>>>> you
>>>>>>>>>>>>> restart a node, it will receive information, that it failed to
>>>>>> finish
>>>>>>>>>>>>> writing to disk, from other nodes.
>>>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata
>> to
>>>>>> disk,
>>>>>>>>>>>>> then after a restart the type will be considered unregistered,
>> so
>>>>>>>>> another
>>>>>>>>>>>>> registration will be required.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The second case is a bit more complicated. But it can be
>>>> resolved
>>>>>> by
>>>>>>>>>>>>> making the discovery threads on every node create a future,
>> that
>>>>>> will
>>>>>>>>> be
>>>>>>>>>>>>> completed when writing to disk is finished. So, every node will
>>>>>> have
>>>>>>>>> such
>>>>>>>>>>>>> future, that will reflect the current state of persisting the
>>>>>>>>> metadata to
>>>>>>>>>>>>> disk.
>>>>>>>>>>>>>> After that, if some operation needs this binary type, it will
>>>>>> need to
>>>>>>>>>>>>> wait on that future until flushing to disk is finished.
>>>>>>>>>>>>>> This way discovery threads won’t be blocked, but other
>> threads,
>>>>>> that
>>>>>>>>>>>>> actually need this type, will be.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please let me know what you think about that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Alexei Scherbakov
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Zhenya Stanilovsky
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Ivan Pavlukhin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Alexei Scherbakov
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Zhenya Stanilovsky
>>>>>>
>>>>
>>>>
>>
>>
>
> --
>
> Best regards,
> Alexei Scherbakov

Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Denis Mekhanikov,

I think at least one node (coordinator for example) still should write
metadata synchronously to protect from a scenario:

tx creating new metadata is commited <- all nodes in grid are failed
(powered off) <- async writing to disk is completed

where <- means "happens before"

All other nodes could write asynchronously, by using separate thread or not
doing fsync( same effect)

ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <[hidden email]>:

> Alexey,
>
> I’m not suggesting to duplicate anything.
> My point is that the proper fix will be implemented in a relatively
> distant future. Why not improve the existing mechanism now instead of
> waiting for the proper fix?
> If we don’t agree on doing this fix in master, I can do it in a fork and
> use it in my setup. So please let me know if you see any other drawbacks in
> the proposed solution.
>
> Denis
>
> > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> [hidden email]> wrote:
> >
> > Denis Mekhanikov,
> >
> > If we are still talking about "proper" solution the metastore (I've meant
> > of course distributed one) is the way to go.
> >
> > It has a contract to store cluster wide metadata in most efficient way
> and
> > can have any optimization for concurrent writing inside.
> >
> > I'm against creating some duplicating mechanism as you suggested. We do
> not
> > need another copy/paste code.
> >
> > Another possibility is to carry metadata along with appropriate request
> if
> > it's not found locally but this is a rather big modification.
> >
> >
> >
> > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <[hidden email]>:
> >
> >> Eduard,
> >>
> >> Usages will wait for the metadata to be registered and written to disk.
> No
> >> races should occur with such flow.
> >> Or do you have some specific case on your mind?
> >>
> >> I agree, that using a distributed meta storage would be nice here.
> >> But this way we will kind of move to the previous scheme with a
> replicated
> >> system cache, where metadata was stored before.
> >> Will scheme with the metastorage be different in any way? Won’t we
> decide
> >> to move back to discovery messages again after a while?
> >>
> >> Denis
> >>
> >>
> >>> On 20 Aug 2019, at 15:13, Eduard Shangareev <
> [hidden email]>
> >> wrote:
> >>>
> >>> Denis,
> >>> How would we deal with races between registration and metadata usages
> >> with
> >>> such fast-fix?
> >>>
> >>> I believe, that we need to move it to distributed metastorage, and
> await
> >>> registration completeness if we can't find it (wait for work in
> >> progress).
> >>> Discovery shouldn't wait for anything here.
> >>>
> >>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> [hidden email]
> >>>
> >>> wrote:
> >>>
> >>>> Sergey,
> >>>>
> >>>> Currently metadata is written to disk sequentially on every node. Only
> >> one
> >>>> node at a time is able to write metadata to its storage.
> >>>> Slowness accumulates when you add more nodes. A delay required to
> write
> >>>> one piece of metadata may be not that big, but if you multiply it by
> say
> >>>> 200, then it becomes noticeable.
> >>>> But If we move the writing out from discovery threads, then nodes will
> >> be
> >>>> doing it in parallel.
> >>>>
> >>>> I think, it’s better to block some threads from a striped pool for a
> >>>> little while rather than blocking discovery for the same period, but
> >>>> multiplied by a number of nodes.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Denis
> >>>>
> >>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]
> >
> >>>> wrote:
> >>>>>
> >>>>> Denis,
> >>>>>
> >>>>> Thanks for bringing this issue up, decision to write binary metadata
> >> from
> >>>>> discovery thread was really a tough decision to make.
> >>>>> I don't think that moving metadata to metastorage is a silver bullet
> >> here
> >>>>> as this approach also has its drawbacks and is not an easy change.
> >>>>>
> >>>>> In addition to workarounds suggested by Alexei we have two choices to
> >>>>> offload write operation from discovery thread:
> >>>>>
> >>>>> 1. Your scheme with a separate writer thread and futures completed
> >> when
> >>>>> write operation is finished.
> >>>>> 2. PME-like protocol with obvious complications like failover and
> >>>>> asynchronous wait for replies over communication layer.
> >>>>>
> >>>>> Your suggestion looks easier from code complexity perspective but in
> my
> >>>>> view it increases chances to get into starvation. Now if some node
> >> faces
> >>>>> really long delays during write op it is gonna be kicked out of
> >> topology
> >>>> by
> >>>>> discovery protocol. In your case it is possible that more and more
> >>>> threads
> >>>>> from other pools may stuck waiting on the operation future, it is
> also
> >>>> not
> >>>>> good.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> I also think that if we want to approach this issue systematically,
> we
> >>>> need
> >>>>> to do a deep analysis of metastorage option as well and to finally
> >> choose
> >>>>> which road we wanna go.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> >>>>> <[hidden email]> wrote:
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> 1. Yes, only on OS failures. In such case data will be received
> from
> >>>>>> alive
> >>>>>>>> nodes later.
> >>>>>> What behavior would be in case of one node ? I suppose someone can
> >>>> obtain
> >>>>>> cache data without unmarshalling schema, what in this case would be
> >> with
> >>>>>> grid operability?
> >>>>>>
> >>>>>>>
> >>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But
> such
> >>>>>> mode
> >>>>>>>> should not be used if you have more than two nodes in grid because
> >> it
> >>>>>> has
> >>>>>>>> huge impact on performance.
> >>>>>> Is wal mode affects metadata store ?
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> >> [hidden email]
> >>>>>>> :
> >>>>>>>>
> >>>>>>>>> Folks,
> >>>>>>>>>
> >>>>>>>>> Thanks for showing interest in this issue!
> >>>>>>>>>
> >>>>>>>>> Alexey,
> >>>>>>>>>
> >>>>>>>>>> I think removing fsync could help to mitigate performance issues
> >>>> with
> >>>>>>>>> current implementation
> >>>>>>>>>
> >>>>>>>>> Is my understanding correct, that if we remove fsync, then
> >> discovery
> >>>>>> won’t
> >>>>>>>>> be blocked, and data will be flushed to disk in background, and
> >> loss
> >>>> of
> >>>>>>>>> information will be possible only on OS failure? It sounds like
> an
> >>>>>>>>> acceptable workaround to me.
> >>>>>>>>>
> >>>>>>>>> Will moving metadata to metastore actually resolve this issue?
> >> Please
> >>>>>>>>> correct me if I’m wrong, but we will still need to write the
> >>>>>> information to
> >>>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC,
> >> then
> >>>>>> the
> >>>>>>>>> issue will still be there. Or is it planned to abandon the
> >>>>>> discovery-based
> >>>>>>>>> protocol at all?
> >>>>>>>>>
> >>>>>>>>> Evgeniy, Ivan,
> >>>>>>>>>
> >>>>>>>>> In my particular case the data wasn’t too big. It was a slow
> >>>>>> virtualised
> >>>>>>>>> disk with encryption, that made operations slow. Given that there
> >> are
> >>>>>> 200
> >>>>>>>>> nodes in a cluster, where every node writes slowly, and this
> >> process
> >>>> is
> >>>>>>>>> sequential, one piece of metadata is registered extremely slowly.
> >>>>>>>>>
> >>>>>>>>> Ivan, answering to your other questions:
> >>>>>>>>>
> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is
> it
> >>>> so
> >>>>>>>>> accidentally?
> >>>>>>>>>
> >>>>>>>>> It should be checked, if it’s safe to stop writing marshaller
> >>>> mappings
> >>>>>> to
> >>>>>>>>> disk without loosing any guarantees.
> >>>>>>>>> But anyway, I would like to have a property, that would control
> >> this.
> >>>>>> If
> >>>>>>>>> metadata registration is slow, then initial cluster warmup may
> >> take a
> >>>>>>>>> while. So, if we preserve metadata on disk, then we will need to
> >> warm
> >>>>>> it up
> >>>>>>>>> only once, and further restarts won’t be affected.
> >>>>>>>>>
> >>>>>>>>>> Do we really need a fast fix here?
> >>>>>>>>>
> >>>>>>>>> I would like a fix, that could be implemented now, since the
> >> activity
> >>>>>> with
> >>>>>>>>> moving metadata to metastore doesn’t sound like a quick one.
> >> Having a
> >>>>>>>>> temporary solution would be nice.
> >>>>>>>>>
> >>>>>>>>> Denis
> >>>>>>>>>
> >>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Denis,
> >>>>>>>>>>
> >>>>>>>>>> Several clarifying questions:
> >>>>>>>>>> 1. Do you have an idea why metadata registration takes so long?
> So
> >>>>>>>>>> poor disks? So many data to write? A contention with disk writes
> >> by
> >>>>>>>>>> other subsystems?
> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is
> it
> >>>> so
> >>>>>>>>>> accidentally?
> >>>>>>>>>>
> >>>>>>>>>> Generally, I think that it is possible to move metadata saving
> >>>>>>>>>> operations out of discovery thread without loosing required
> >>>>>>>>>> consistency/integrity.
> >>>>>>>>>>
> >>>>>>>>>> As Alex mentioned using metastore looks like a better solution.
> Do
> >>>> we
> >>>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
> >>>>>>>>>>
> >>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>>>>>>>> < [hidden email] >:
> >>>>>>>>>>>
> >>>>>>>>>>> Alexey, but in this case customer need to be informed, that
> whole
> >>>>>> (for
> >>>>>>>>> example 1 node) cluster crash (power off) could lead to partial
> >> data
> >>>>>>>>> unavailability.
> >>>>>>>>>>> And may be further index corruption.
> >>>>>>>>>>> 1. Why your meta takes a substantial size? may be context
> >> leaking ?
> >>>>>>>>>>> 2. Could meta be compressed ?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> >>>>>>>>> [hidden email] >:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Denis Mekhanikov,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the
> case
> >>>> of
> >>>>>>>>>>>> slow-downs in case of metadata burst writes.
> >>>>>>>>>>>> I think removing fsync could help to mitigate performance
> issues
> >>>>>> with
> >>>>>>>>>>>> current implementation until proper solution will be
> >> implemented:
> >>>>>>>>> moving
> >>>>>>>>>>>> metadata to metastore.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> >>>>>> [hidden email]
> >>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I would also like to mention, that marshaller mappings are
> >>>> written
> >>>>>> to
> >>>>>>>>> disk
> >>>>>>>>>>>>> even if persistence is disabled.
> >>>>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> >>>>>> [hidden email] >
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When persistence is enabled, binary metadata is written to
> >> disk
> >>>>>> upon
> >>>>>>>>>>>>> registration. Currently it happens in the discovery thread,
> >> which
> >>>>>>>>> makes
> >>>>>>>>>>>>> processing of related messages very slow.
> >>>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
> >>>> every
> >>>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
> >>>>>>>>> processing of
> >>>>>>>>>>>>> other messages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I propose starting a separate thread that will be
> responsible
> >>>> for
> >>>>>>>>>>>>> writing binary metadata to disk. So, binary type registration
> >>>> will
> >>>>>> be
> >>>>>>>>>>>>> considered finished before information about it will is
> written
> >>>> to
> >>>>>>>>> disks on
> >>>>>>>>>>>>> all nodes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The main concern here is data consistency in cases when a
> node
> >>>>>>>>>>>>> acknowledges type registration and then fails before writing
> >> the
> >>>>>>>>> metadata
> >>>>>>>>>>>>> to disk.
> >>>>>>>>>>>>>> I see two parts of this issue:
> >>>>>>>>>>>>>> Nodes will have different metadata after restarting.
> >>>>>>>>>>>>>> If we write some data into a persisted cache and shut down
> >> nodes
> >>>>>>>>> faster
> >>>>>>>>>>>>> than a new binary type is written to disk, then after a
> restart
> >>>> we
> >>>>>>>>> won’t
> >>>>>>>>>>>>> have a binary type to work with.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The first case is similar to a situation, when one node
> fails,
> >>>> and
> >>>>>>>>> after
> >>>>>>>>>>>>> that a new type is registered in the cluster. This issue is
> >>>>>> resolved
> >>>>>>>>> by the
> >>>>>>>>>>>>> discovery data exchange. All nodes receive information about
> >> all
> >>>>>>>>> binary
> >>>>>>>>>>>>> types in the initial discovery messages sent by other nodes.
> >> So,
> >>>>>> once
> >>>>>>>>> you
> >>>>>>>>>>>>> restart a node, it will receive information, that it failed
> to
> >>>>>> finish
> >>>>>>>>>>>>> writing to disk, from other nodes.
> >>>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata
> >> to
> >>>>>> disk,
> >>>>>>>>>>>>> then after a restart the type will be considered
> unregistered,
> >> so
> >>>>>>>>> another
> >>>>>>>>>>>>> registration will be required.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The second case is a bit more complicated. But it can be
> >>>> resolved
> >>>>>> by
> >>>>>>>>>>>>> making the discovery threads on every node create a future,
> >> that
> >>>>>> will
> >>>>>>>>> be
> >>>>>>>>>>>>> completed when writing to disk is finished. So, every node
> will
> >>>>>> have
> >>>>>>>>> such
> >>>>>>>>>>>>> future, that will reflect the current state of persisting the
> >>>>>>>>> metadata to
> >>>>>>>>>>>>> disk.
> >>>>>>>>>>>>>> After that, if some operation needs this binary type, it
> will
> >>>>>> need to
> >>>>>>>>>>>>> wait on that future until flushing to disk is finished.
> >>>>>>>>>>>>>> This way discovery threads won’t be blocked, but other
> >> threads,
> >>>>>> that
> >>>>>>>>>>>>> actually need this type, will be.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please let me know what you think about that.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Alexei Scherbakov
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Zhenya Stanilovsky
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Ivan Pavlukhin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Alexei Scherbakov
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Zhenya Stanilovsky
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>
>

--

Best regards,
Alexei Scherbakov

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Alexey,

Making only one node write metadata to disk synchronously is a possible and easy to implement solution, but it still has a few drawbacks:

• Discovery will still be blocked on one node. This is better than blocking all nodes one by one, but disk write may take indefinite time, so discovery may still be affected.
• There is an unlikely but at the same time an unpleasant case:
1. A coordinator writes metadata synchronously to disk and finalizes the metadata registration. Other nodes do it asynchronously, so actual fsync to a disk may be delayed.
2. A transaction is committed.
3. The cluster is shut down before all nodes finish their fsync of metadata.
4. Nodes are started again one by one.
5. Before the previous coordinator is started again, a read operation tries to read the data, that uses the metadata that wasn’t fsynced anywhere except the coordinator, which is still not started.
6. Error about unknown metadata is generated.

In the scheme, that Sergey and me proposed, this situation isn’t possible, since the data won’t be written to disk until fsync is finished. Every mapped node will wait on a future until metadata is written to disk before performing any cache changes.
What do you think about such fix?

Denis
On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <[hidden email]>, wrote:

> Denis Mekhanikov,
>
> I think at least one node (coordinator for example) still should write
> metadata synchronously to protect from a scenario:
>
> tx creating new metadata is commited <- all nodes in grid are failed
> (powered off) <- async writing to disk is completed
>
> where <- means "happens before"
>
> All other nodes could write asynchronously, by using separate thread or not
> doing fsync( same effect)
>
>
>
> ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <[hidden email]>:
>
> > Alexey,
> >
> > I’m not suggesting to duplicate anything.
> > My point is that the proper fix will be implemented in a relatively
> > distant future. Why not improve the existing mechanism now instead of
> > waiting for the proper fix?
> > If we don’t agree on doing this fix in master, I can do it in a fork and
> > use it in my setup. So please let me know if you see any other drawbacks in
> > the proposed solution.
> >
> > Denis
> >
> > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > [hidden email]> wrote:
> > >
> > > Denis Mekhanikov,
> > >
> > > If we are still talking about "proper" solution the metastore (I've meant
> > > of course distributed one) is the way to go.
> > >
> > > It has a contract to store cluster wide metadata in most efficient way
> > and
> > > can have any optimization for concurrent writing inside.
> > >
> > > I'm against creating some duplicating mechanism as you suggested. We do
> > not
> > > need another copy/paste code.
> > >
> > > Another possibility is to carry metadata along with appropriate request
> > if
> > > it's not found locally but this is a rather big modification.
> > >
> > >
> > >
> > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <[hidden email]>:
> > >
> > > > Eduard,
> > > >
> > > > Usages will wait for the metadata to be registered and written to disk.
> > No
> > > > races should occur with such flow.
> > > > Or do you have some specific case on your mind?
> > > >
> > > > I agree, that using a distributed meta storage would be nice here.
> > > > But this way we will kind of move to the previous scheme with a
> > replicated
> > > > system cache, where metadata was stored before.
> > > > Will scheme with the metastorage be different in any way? Won’t we
> > decide
> > > > to move back to discovery messages again after a while?
> > > >
> > > > Denis
> > > >
> > > >
> > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > [hidden email]>
> > > > wrote:
> > > > >
> > > > > Denis,
> > > > > How would we deal with races between registration and metadata usages
> > > > with
> > > > > such fast-fix?
> > > > >
> > > > > I believe, that we need to move it to distributed metastorage, and
> > await
> > > > > registration completeness if we can't find it (wait for work in
> > > > progress).
> > > > > Discovery shouldn't wait for anything here.
> > > > >
> > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Sergey,
> > > > > >
> > > > > > Currently metadata is written to disk sequentially on every node. Only
> > > > one
> > > > > > node at a time is able to write metadata to its storage.
> > > > > > Slowness accumulates when you add more nodes. A delay required to
> > write
> > > > > > one piece of metadata may be not that big, but if you multiply it by
> > say
> > > > > > 200, then it becomes noticeable.
> > > > > > But If we move the writing out from discovery threads, then nodes will
> > > > be
> > > > > > doing it in parallel.
> > > > > >
> > > > > > I think, it’s better to block some threads from a striped pool for a
> > > > > > little while rather than blocking discovery for the same period, but
> > > > > > multiplied by a number of nodes.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Denis
> > > > > >
> > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <[hidden email]
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > Thanks for bringing this issue up, decision to write binary metadata
> > > > from
> > > > > > > discovery thread was really a tough decision to make.
> > > > > > > I don't think that moving metadata to metastorage is a silver bullet
> > > > here
> > > > > > > as this approach also has its drawbacks and is not an easy change.
> > > > > > >
> > > > > > > In addition to workarounds suggested by Alexei we have two choices to
> > > > > > > offload write operation from discovery thread:
> > > > > > >
> > > > > > > 1. Your scheme with a separate writer thread and futures completed
> > > > when
> > > > > > > write operation is finished.
> > > > > > > 2. PME-like protocol with obvious complications like failover and
> > > > > > > asynchronous wait for replies over communication layer.
> > > > > > >
> > > > > > > Your suggestion looks easier from code complexity perspective but in
> > my
> > > > > > > view it increases chances to get into starvation. Now if some node
> > > > faces
> > > > > > > really long delays during write op it is gonna be kicked out of
> > > > topology
> > > > > > by
> > > > > > > discovery protocol. In your case it is possible that more and more
> > > > > > threads
> > > > > > > from other pools may stuck waiting on the operation future, it is
> > also
> > > > > > not
> > > > > > > good.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > I also think that if we want to approach this issue systematically,
> > we
> > > > > > need
> > > > > > > to do a deep analysis of metastorage option as well and to finally
> > > > choose
> > > > > > > which road we wanna go.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > <[hidden email]> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1. Yes, only on OS failures. In such case data will be received
> > from
> > > > > > > > alive
> > > > > > > > > > nodes later.
> > > > > > > > What behavior would be in case of one node ? I suppose someone can
> > > > > > obtain
> > > > > > > > cache data without unmarshalling schema, what in this case would be
> > > > with
> > > > > > > > grid operability?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. Yes, for walmode=FSYNC writes to metastore will be slow. But
> > such
> > > > > > > > mode
> > > > > > > > > > should not be used if you have more than two nodes in grid because
> > > > it
> > > > > > > > has
> > > > > > > > > > huge impact on performance.
> > > > > > > > Is wal mode affects metadata store ?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> > > > [hidden email]
> > > > > > > > > :
> > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for showing interest in this issue!
> > > > > > > > > > >
> > > > > > > > > > > Alexey,
> > > > > > > > > > >
> > > > > > > > > > > > I think removing fsync could help to mitigate performance issues
> > > > > > with
> > > > > > > > > > > current implementation
> > > > > > > > > > >
> > > > > > > > > > > Is my understanding correct, that if we remove fsync, then
> > > > discovery
> > > > > > > > won’t
> > > > > > > > > > > be blocked, and data will be flushed to disk in background, and
> > > > loss
> > > > > > of
> > > > > > > > > > > information will be possible only on OS failure? It sounds like
> > an
> > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > >
> > > > > > > > > > > Will moving metadata to metastore actually resolve this issue?
> > > > Please
> > > > > > > > > > > correct me if I’m wrong, but we will still need to write the
> > > > > > > > information to
> > > > > > > > > > > WAL before releasing the discovery thread. If WAL mode is FSYNC,
> > > > then
> > > > > > > > the
> > > > > > > > > > > issue will still be there. Or is it planned to abandon the
> > > > > > > > discovery-based
> > > > > > > > > > > protocol at all?
> > > > > > > > > > >
> > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > >
> > > > > > > > > > > In my particular case the data wasn’t too big. It was a slow
> > > > > > > > virtualised
> > > > > > > > > > > disk with encryption, that made operations slow. Given that there
> > > > are
> > > > > > > > 200
> > > > > > > > > > > nodes in a cluster, where every node writes slowly, and this
> > > > process
> > > > > > is
> > > > > > > > > > > sequential, one piece of metadata is registered extremely slowly.
> > > > > > > > > > >
> > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > >
> > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory caches? Or is
> > it
> > > > > > so
> > > > > > > > > > > accidentally?
> > > > > > > > > > >
> > > > > > > > > > > It should be checked, if it’s safe to stop writing marshaller
> > > > > > mappings
> > > > > > > > to
> > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > But anyway, I would like to have a property, that would control
> > > > this.
> > > > > > > > If
> > > > > > > > > > > metadata registration is slow, then initial cluster warmup may
> > > > take a
> > > > > > > > > > > while. So, if we preserve metadata on disk, then we will need to
> > > > warm
> > > > > > > > it up
> > > > > > > > > > > only once, and further restarts won’t be affected.
> > > > > > > > > > >
> > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > >
> > > > > > > > > > > I would like a fix, that could be implemented now, since the
> > > > activity
> > > > > > > > with
> > > > > > > > > > > moving metadata to metastore doesn’t sound like a quick one.
> > > > Having a
> > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > >
> > > > > > > > > > > Denis
> > > > > > > > > > >
> > > > > > > > > > > > On 14 Aug 2019, at 11:53, Павлухин Иван < [hidden email] >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Denis,
> > > > > > > > > > > >
> > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > 1. Do you have an idea why metadata registration takes so long?
> > So
> > > > > > > > > > > > poor disks? So many data to write? A contention with disk writes
> > > > by
> > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory caches? Or is
> > it
> > > > > > so
> > > > > > > > > > > > accidentally?
> > > > > > > > > > > >
> > > > > > > > > > > > Generally, I think that it is possible to move metadata saving
> > > > > > > > > > > > operations out of discovery thread without loosing required
> > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > >
> > > > > > > > > > > > As Alex mentioned using metastore looks like a better solution.
> > Do
> > > > > > we
> > > > > > > > > > > > really need a fast fix here? (Are we talking about fast fix?)
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> > > > > > > > > > > < [hidden email] >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alexey, but in this case customer need to be informed, that
> > whole
> > > > > > > > (for
> > > > > > > > > > > example 1 node) cluster crash (power off) could lead to partial
> > > > data
> > > > > > > > > > > unavailability.
> > > > > > > > > > > > > And may be further index corruption.
> > > > > > > > > > > > > 1. Why your meta takes a substantial size? may be context
> > > > leaking ?
> > > > > > > > > > > > > 2. Could meta be compressed ?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> > > > > > > > > > > [hidden email] >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently metadata are fsync'ed on write. This might be the
> > case
> > > > > > of
> > > > > > > > > > > > > > slow-downs in case of metadata burst writes.
> > > > > > > > > > > > > > I think removing fsync could help to mitigate performance
> > issues
> > > > > > > > with
> > > > > > > > > > > > > > current implementation until proper solution will be
> > > > implemented:
> > > > > > > > > > > moving
> > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> > > > > > > > [hidden email]
> > > > > > > > > > > > :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would also like to mention, that marshaller mappings are
> > > > > > written
> > > > > > > > to
> > > > > > > > > > > disk
> > > > > > > > > > > > > > > even if persistence is disabled.
> > > > > > > > > > > > > > > So, this issue affects purely in-memory clusters as well.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> > > > > > > > [hidden email] >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When persistence is enabled, binary metadata is written to
> > > > disk
> > > > > > > > upon
> > > > > > > > > > > > > > > registration. Currently it happens in the discovery thread,
> > > > which
> > > > > > > > > > > makes
> > > > > > > > > > > > > > > processing of related messages very slow.
> > > > > > > > > > > > > > > > There are cases, when a lot of nodes and slow disks can make
> > > > > > every
> > > > > > > > > > > > > > > binary type be registered for several minutes. Plus it blocks
> > > > > > > > > > > processing of
> > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I propose starting a separate thread that will be
> > responsible
> > > > > > for
> > > > > > > > > > > > > > > writing binary metadata to disk. So, binary type registration
> > > > > > will
> > > > > > > > be
> > > > > > > > > > > > > > > considered finished before information about it will is
> > written
> > > > > > to
> > > > > > > > > > > disks on
> > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The main concern here is data consistency in cases when a
> > node
> > > > > > > > > > > > > > > acknowledges type registration and then fails before writing
> > > > the
> > > > > > > > > > > metadata
> > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > I see two parts of this issue:
> > > > > > > > > > > > > > > > Nodes will have different metadata after restarting.
> > > > > > > > > > > > > > > > If we write some data into a persisted cache and shut down
> > > > nodes
> > > > > > > > > > > faster
> > > > > > > > > > > > > > > than a new binary type is written to disk, then after a
> > restart
> > > > > > we
> > > > > > > > > > > won’t
> > > > > > > > > > > > > > > have a binary type to work with.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The first case is similar to a situation, when one node
> > fails,
> > > > > > and
> > > > > > > > > > > after
> > > > > > > > > > > > > > > that a new type is registered in the cluster. This issue is
> > > > > > > > resolved
> > > > > > > > > > > by the
> > > > > > > > > > > > > > > discovery data exchange. All nodes receive information about
> > > > all
> > > > > > > > > > > binary
> > > > > > > > > > > > > > > types in the initial discovery messages sent by other nodes.
> > > > So,
> > > > > > > > once
> > > > > > > > > > > you
> > > > > > > > > > > > > > > restart a node, it will receive information, that it failed
> > to
> > > > > > > > finish
> > > > > > > > > > > > > > > writing to disk, from other nodes.
> > > > > > > > > > > > > > > > If all nodes shut down before finishing writing the metadata
> > > > to
> > > > > > > > disk,
> > > > > > > > > > > > > > > then after a restart the type will be considered
> > unregistered,
> > > > so
> > > > > > > > > > > another
> > > > > > > > > > > > > > > registration will be required.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The second case is a bit more complicated. But it can be
> > > > > > resolved
> > > > > > > > by
> > > > > > > > > > > > > > > making the discovery threads on every node create a future,
> > > > that
> > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > > > completed when writing to disk is finished. So, every node
> > will
> > > > > > > > have
> > > > > > > > > > > such
> > > > > > > > > > > > > > > future, that will reflect the current state of persisting the
> > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > After that, if some operation needs this binary type, it
> > will
> > > > > > > > need to
> > > > > > > > > > > > > > > wait on that future until flushing to disk is finished.
> > > > > > > > > > > > > > > > This way discovery threads won’t be blocked, but other
> > > > threads,
> > > > > > > > that
> > > > > > > > > > > > > > > actually need this type, will be.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please let me know what you think about that.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Alexei Scherbakov
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Zhenya Stanilovsky
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> >
> >
>
> --
>
> Best regards,
> Alexei Scherbakov

Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Do I understand correctly what only affected requests with "dirty" metadata
will be delayed, but not all ?
Doesn't this check hurt performance? Otherwise ALL requests will be blocked
until some unrelated metadata is written which is highly undesirable.

Otherwise looks good if performance will not be affected by implementation.

чт, 22 авг. 2019 г. в 15:18, Denis Mekhanikov <[hidden email]>:

> Alexey,
>
> Making only one node write metadata to disk synchronously is a possible
> and easy to implement solution, but it still has a few drawbacks:
>
> • Discovery will still be blocked on one node. This is better than
> blocking all nodes one by one, but disk write may take indefinite time, so
> discovery may still be affected.
> • There is an unlikely but at the same time an unpleasant case:
> 1. A coordinator writes metadata synchronously to disk and finalizes
> the metadata registration. Other nodes do it asynchronously, so actual
> fsync to a disk may be delayed.
> 2. A transaction is committed.
> 3. The cluster is shut down before all nodes finish their fsync of
> metadata.
> 4. Nodes are started again one by one.
> 5. Before the previous coordinator is started again, a read operation
> tries to read the data, that uses the metadata that wasn’t fsynced anywhere
> except the coordinator, which is still not started.
> 6. Error about unknown metadata is generated.
>
> In the scheme, that Sergey and me proposed, this situation isn’t possible,
> since the data won’t be written to disk until fsync is finished. Every
> mapped node will wait on a future until metadata is written to disk before
> performing any cache changes.
> What do you think about such fix?
>
> Denis
> On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <
> [hidden email]>, wrote:
> > Denis Mekhanikov,
> >
> > I think at least one node (coordinator for example) still should write
> > metadata synchronously to protect from a scenario:
> >
> > tx creating new metadata is commited <- all nodes in grid are failed
> > (powered off) <- async writing to disk is completed
> >
> > where <- means "happens before"
> >
> > All other nodes could write asynchronously, by using separate thread or
> not
> > doing fsync( same effect)
> >
> >
> >
> > ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <[hidden email]>:
> >
> > > Alexey,
> > >
> > > I’m not suggesting to duplicate anything.
> > > My point is that the proper fix will be implemented in a relatively
> > > distant future. Why not improve the existing mechanism now instead of
> > > waiting for the proper fix?
> > > If we don’t agree on doing this fix in master, I can do it in a fork
> and
> > > use it in my setup. So please let me know if you see any other
> drawbacks in
> > > the proposed solution.
> > >
> > > Denis
> > >
> > > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > > [hidden email]> wrote:
> > > >
> > > > Denis Mekhanikov,
> > > >
> > > > If we are still talking about "proper" solution the metastore (I've
> meant
> > > > of course distributed one) is the way to go.
> > > >
> > > > It has a contract to store cluster wide metadata in most efficient
> way
> > > and
> > > > can have any optimization for concurrent writing inside.
> > > >
> > > > I'm against creating some duplicating mechanism as you suggested. We
> do
> > > not
> > > > need another copy/paste code.
> > > >
> > > > Another possibility is to carry metadata along with appropriate
> request
> > > if
> > > > it's not found locally but this is a rather big modification.
> > > >
> > > >
> > > >
> > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <[hidden email]
> >:
> > > >
> > > > > Eduard,
> > > > >
> > > > > Usages will wait for the metadata to be registered and written to
> disk.
> > > No
> > > > > races should occur with such flow.
> > > > > Or do you have some specific case on your mind?
> > > > >
> > > > > I agree, that using a distributed meta storage would be nice here.
> > > > > But this way we will kind of move to the previous scheme with a
> > > replicated
> > > > > system cache, where metadata was stored before.
> > > > > Will scheme with the metastorage be different in any way? Won’t we
> > > decide
> > > > > to move back to discovery messages again after a while?
> > > > >
> > > > > Denis
> > > > >
> > > > >
> > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > Denis,
> > > > > > How would we deal with races between registration and metadata
> usages
> > > > > with
> > > > > > such fast-fix?
> > > > > >
> > > > > > I believe, that we need to move it to distributed metastorage,
> and
> > > await
> > > > > > registration completeness if we can't find it (wait for work in
> > > > > progress).
> > > > > > Discovery shouldn't wait for anything here.
> > > > > >
> > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > > [hidden email]
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sergey,
> > > > > > >
> > > > > > > Currently metadata is written to disk sequentially on every
> node. Only
> > > > > one
> > > > > > > node at a time is able to write metadata to its storage.
> > > > > > > Slowness accumulates when you add more nodes. A delay required
> to
> > > write
> > > > > > > one piece of metadata may be not that big, but if you multiply
> it by
> > > say
> > > > > > > 200, then it becomes noticeable.
> > > > > > > But If we move the writing out from discovery threads, then
> nodes will
> > > > > be
> > > > > > > doing it in parallel.
> > > > > > >
> > > > > > > I think, it’s better to block some threads from a striped pool
> for a
> > > > > > > little while rather than blocking discovery for the same
> period, but
> > > > > > > multiplied by a number of nodes.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Denis
> > > > > > >
> > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <
> [hidden email]
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > Thanks for bringing this issue up, decision to write binary
> metadata
> > > > > from
> > > > > > > > discovery thread was really a tough decision to make.
> > > > > > > > I don't think that moving metadata to metastorage is a
> silver bullet
> > > > > here
> > > > > > > > as this approach also has its drawbacks and is not an easy
> change.
> > > > > > > >
> > > > > > > > In addition to workarounds suggested by Alexei we have two
> choices to
> > > > > > > > offload write operation from discovery thread:
> > > > > > > >
> > > > > > > > 1. Your scheme with a separate writer thread and futures
> completed
> > > > > when
> > > > > > > > write operation is finished.
> > > > > > > > 2. PME-like protocol with obvious complications like
> failover and
> > > > > > > > asynchronous wait for replies over communication layer.
> > > > > > > >
> > > > > > > > Your suggestion looks easier from code complexity
> perspective but in
> > > my
> > > > > > > > view it increases chances to get into starvation. Now if
> some node
> > > > > faces
> > > > > > > > really long delays during write op it is gonna be kicked out
> of
> > > > > topology
> > > > > > > by
> > > > > > > > discovery protocol. In your case it is possible that more
> and more
> > > > > > > threads
> > > > > > > > from other pools may stuck waiting on the operation future,
> it is
> > > also
> > > > > > > not
> > > > > > > > good.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > I also think that if we want to approach this issue
> systematically,
> > > we
> > > > > > > need
> > > > > > > > to do a deep analysis of metastorage option as well and to
> finally
> > > > > choose
> > > > > > > > which road we wanna go.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > > <[hidden email]> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1. Yes, only on OS failures. In such case data will be
> received
> > > from
> > > > > > > > > alive
> > > > > > > > > > > nodes later.
> > > > > > > > > What behavior would be in case of one node ? I suppose
> someone can
> > > > > > > obtain
> > > > > > > > > cache data without unmarshalling schema, what in this case
> would be
> > > > > with
> > > > > > > > > grid operability?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. Yes, for walmode=FSYNC writes to metastore will be
> slow. But
> > > such
> > > > > > > > > mode
> > > > > > > > > > > should not be used if you have more than two nodes in
> grid because
> > > > > it
> > > > > > > > > has
> > > > > > > > > > > huge impact on performance.
> > > > > > > > > Is wal mode affects metadata store ?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> > > > > [hidden email]
> > > > > > > > > > :
> > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for showing interest in this issue!
> > > > > > > > > > > >
> > > > > > > > > > > > Alexey,
> > > > > > > > > > > >
> > > > > > > > > > > > > I think removing fsync could help to mitigate
> performance issues
> > > > > > > with
> > > > > > > > > > > > current implementation
> > > > > > > > > > > >
> > > > > > > > > > > > Is my understanding correct, that if we remove
> fsync, then
> > > > > discovery
> > > > > > > > > won’t
> > > > > > > > > > > > be blocked, and data will be flushed to disk in
> background, and
> > > > > loss
> > > > > > > of
> > > > > > > > > > > > information will be possible only on OS failure? It
> sounds like
> > > an
> > > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > > >
> > > > > > > > > > > > Will moving metadata to metastore actually resolve
> this issue?
> > > > > Please
> > > > > > > > > > > > correct me if I’m wrong, but we will still need to
> write the
> > > > > > > > > information to
> > > > > > > > > > > > WAL before releasing the discovery thread. If WAL
> mode is FSYNC,
> > > > > then
> > > > > > > > > the
> > > > > > > > > > > > issue will still be there. Or is it planned to
> abandon the
> > > > > > > > > discovery-based
> > > > > > > > > > > > protocol at all?
> > > > > > > > > > > >
> > > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > > >
> > > > > > > > > > > > In my particular case the data wasn’t too big. It
> was a slow
> > > > > > > > > virtualised
> > > > > > > > > > > > disk with encryption, that made operations slow.
> Given that there
> > > > > are
> > > > > > > > > 200
> > > > > > > > > > > > nodes in a cluster, where every node writes slowly,
> and this
> > > > > process
> > > > > > > is
> > > > > > > > > > > > sequential, one piece of metadata is registered
> extremely slowly.
> > > > > > > > > > > >
> > > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> caches? Or is
> > > it
> > > > > > > so
> > > > > > > > > > > > accidentally?
> > > > > > > > > > > >
> > > > > > > > > > > > It should be checked, if it’s safe to stop writing
> marshaller
> > > > > > > mappings
> > > > > > > > > to
> > > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > > But anyway, I would like to have a property, that
> would control
> > > > > this.
> > > > > > > > > If
> > > > > > > > > > > > metadata registration is slow, then initial cluster
> warmup may
> > > > > take a
> > > > > > > > > > > > while. So, if we preserve metadata on disk, then we
> will need to
> > > > > warm
> > > > > > > > > it up
> > > > > > > > > > > > only once, and further restarts won’t be affected.
> > > > > > > > > > > >
> > > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > > >
> > > > > > > > > > > > I would like a fix, that could be implemented now,
> since the
> > > > > activity
> > > > > > > > > with
> > > > > > > > > > > > moving metadata to metastore doesn’t sound like a
> quick one.
> > > > > Having a
> > > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > > >
> > > > > > > > > > > > Denis
> > > > > > > > > > > >
> > > > > > > > > > > > > On 14 Aug 2019, at 11:53, Павлухин Иван <
> [hidden email] >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > > 1. Do you have an idea why metadata registration
> takes so long?
> > > So
> > > > > > > > > > > > > poor disks? So many data to write? A contention
> with disk writes
> > > > > by
> > > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> caches? Or is
> > > it
> > > > > > > so
> > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Generally, I think that it is possible to move
> metadata saving
> > > > > > > > > > > > > operations out of discovery thread without loosing
> required
> > > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As Alex mentioned using metastore looks like a
> better solution.
> > > Do
> > > > > > > we
> > > > > > > > > > > > > really need a fast fix here? (Are we talking about
> fast fix?)
> > > > > > > > > > > > >
> > > > > > > > > > > > > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> > > > > > > > > > > > < [hidden email] >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Alexey, but in this case customer need to be
> informed, that
> > > whole
> > > > > > > > > (for
> > > > > > > > > > > > example 1 node) cluster crash (power off) could lead
> to partial
> > > > > data
> > > > > > > > > > > > unavailability.
> > > > > > > > > > > > > > And may be further index corruption.
> > > > > > > > > > > > > > 1. Why your meta takes a substantial size? may
> be context
> > > > > leaking ?
> > > > > > > > > > > > > > 2. Could meta be compressed ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Среда, 14 августа 2019, 11:22 +03:00 от Alexei
> Scherbakov <
> > > > > > > > > > > > [hidden email] >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Currently metadata are fsync'ed on write. This
> might be the
> > > case
> > > > > > > of
> > > > > > > > > > > > > > > slow-downs in case of metadata burst writes.
> > > > > > > > > > > > > > > I think removing fsync could help to mitigate
> performance
> > > issues
> > > > > > > > > with
> > > > > > > > > > > > > > > current implementation until proper solution
> will be
> > > > > implemented:
> > > > > > > > > > > > moving
> > > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> > > > > > > > > [hidden email]
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would also like to mention, that
> marshaller mappings are
> > > > > > > written
> > > > > > > > > to
> > > > > > > > > > > > disk
> > > > > > > > > > > > > > > > even if persistence is disabled.
> > > > > > > > > > > > > > > > So, this issue affects purely in-memory
> clusters as well.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On 13 Aug 2019, at 17:06, Denis Mekhanikov
> <
> > > > > > > > > [hidden email] >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > When persistence is enabled, binary
> metadata is written to
> > > > > disk
> > > > > > > > > upon
> > > > > > > > > > > > > > > > registration. Currently it happens in the
> discovery thread,
> > > > > which
> > > > > > > > > > > > makes
> > > > > > > > > > > > > > > > processing of related messages very slow.
> > > > > > > > > > > > > > > > > There are cases, when a lot of nodes and
> slow disks can make
> > > > > > > every
> > > > > > > > > > > > > > > > binary type be registered for several
> minutes. Plus it blocks
> > > > > > > > > > > > processing of
> > > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I propose starting a separate thread that
> will be
> > > responsible
> > > > > > > for
> > > > > > > > > > > > > > > > writing binary metadata to disk. So, binary
> type registration
> > > > > > > will
> > > > > > > > > be
> > > > > > > > > > > > > > > > considered finished before information about
> it will is
> > > written
> > > > > > > to
> > > > > > > > > > > > disks on
> > > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The main concern here is data consistency
> in cases when a
> > > node
> > > > > > > > > > > > > > > > acknowledges type registration and then
> fails before writing
> > > > > the
> > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > > I see two parts of this issue:
> > > > > > > > > > > > > > > > > Nodes will have different metadata after
> restarting.
> > > > > > > > > > > > > > > > > If we write some data into a persisted
> cache and shut down
> > > > > nodes
> > > > > > > > > > > > faster
> > > > > > > > > > > > > > > > than a new binary type is written to disk,
> then after a
> > > restart
> > > > > > > we
> > > > > > > > > > > > won’t
> > > > > > > > > > > > > > > > have a binary type to work with.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The first case is similar to a situation,
> when one node
> > > fails,
> > > > > > > and
> > > > > > > > > > > > after
> > > > > > > > > > > > > > > > that a new type is registered in the
> cluster. This issue is
> > > > > > > > > resolved
> > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > discovery data exchange. All nodes receive
> information about
> > > > > all
> > > > > > > > > > > > binary
> > > > > > > > > > > > > > > > types in the initial discovery messages sent
> by other nodes.
> > > > > So,
> > > > > > > > > once
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > > restart a node, it will receive information,
> that it failed
> > > to
> > > > > > > > > finish
> > > > > > > > > > > > > > > > writing to disk, from other nodes.
> > > > > > > > > > > > > > > > > If all nodes shut down before finishing
> writing the metadata
> > > > > to
> > > > > > > > > disk,
> > > > > > > > > > > > > > > > then after a restart the type will be
> considered
> > > unregistered,
> > > > > so
> > > > > > > > > > > > another
> > > > > > > > > > > > > > > > registration will be required.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The second case is a bit more complicated.
> But it can be
> > > > > > > resolved
> > > > > > > > > by
> > > > > > > > > > > > > > > > making the discovery threads on every node
> create a future,
> > > > > that
> > > > > > > > > will
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > completed when writing to disk is finished.
> So, every node
> > > will
> > > > > > > > > have
> > > > > > > > > > > > such
> > > > > > > > > > > > > > > > future, that will reflect the current state
> of persisting the
> > > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > > After that, if some operation needs this
> binary type, it
> > > will
> > > > > > > > > need to
> > > > > > > > > > > > > > > > wait on that future until flushing to disk
> is finished.
> > > > > > > > > > > > > > > > > This way discovery threads won’t be
> blocked, but other
> > > > > threads,
> > > > > > > > > that
> > > > > > > > > > > > > > > > actually need this type, will be.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please let me know what you think about
> that.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Zhenya Stanilovsky
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > >
> > >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>

--

Best regards,
Alexei Scherbakov

Sergey Chugunov

Re: Asynchronous registration of binary metadata

Alexei, If my understanding is correct (Denis please correct me if I'm
wrong) we'll indeed delay only reqs that touch "dirty" metadata (metadata
with unfinished write to disk).

I don't expect significant performance impact here because for now we don't
allow other threads to use "dirty" metadata anyway and declare it "clean"
only when it is fully written to disk.

As far as I can see the only source of performance degradation here would
be additional handing-off "write metadata tasks" between discovery thread
and "writer" thread. But this should be minor comparing to IO operations.

On Fri, Aug 23, 2019 at 2:02 PM Alexei Scherbakov <
[hidden email]> wrote:

> Do I understand correctly what only affected requests with "dirty" metadata
> will be delayed, but not all ?
> Doesn't this check hurt performance? Otherwise ALL requests will be blocked
> until some unrelated metadata is written which is highly undesirable.
>
> Otherwise looks good if performance will not be affected by implementation.
>
>
> чт, 22 авг. 2019 г. в 15:18, Denis Mekhanikov <[hidden email]>:
>
> > Alexey,
> >
> > Making only one node write metadata to disk synchronously is a possible
> > and easy to implement solution, but it still has a few drawbacks:
> >
> > • Discovery will still be blocked on one node. This is better than
> > blocking all nodes one by one, but disk write may take indefinite time,
> so
> > discovery may still be affected.
> > • There is an unlikely but at the same time an unpleasant case:
> > 1. A coordinator writes metadata synchronously to disk and finalizes
> > the metadata registration. Other nodes do it asynchronously, so actual
> > fsync to a disk may be delayed.
> > 2. A transaction is committed.
> > 3. The cluster is shut down before all nodes finish their fsync of
> > metadata.
> > 4. Nodes are started again one by one.
> > 5. Before the previous coordinator is started again, a read operation
> > tries to read the data, that uses the metadata that wasn’t fsynced
> anywhere
> > except the coordinator, which is still not started.
> > 6. Error about unknown metadata is generated.
> >
> > In the scheme, that Sergey and me proposed, this situation isn’t
> possible,
> > since the data won’t be written to disk until fsync is finished. Every
> > mapped node will wait on a future until metadata is written to disk
> before
> > performing any cache changes.
> > What do you think about such fix?
> >
> > Denis
> > On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <
> > [hidden email]>, wrote:
> > > Denis Mekhanikov,
> > >
> > > I think at least one node (coordinator for example) still should write
> > > metadata synchronously to protect from a scenario:
> > >
> > > tx creating new metadata is commited <- all nodes in grid are failed
> > > (powered off) <- async writing to disk is completed
> > >
> > > where <- means "happens before"
> > >
> > > All other nodes could write asynchronously, by using separate thread or
> > not
> > > doing fsync( same effect)
> > >
> > >
> > >
> > > ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <[hidden email]>:
> > >
> > > > Alexey,
> > > >
> > > > I’m not suggesting to duplicate anything.
> > > > My point is that the proper fix will be implemented in a relatively
> > > > distant future. Why not improve the existing mechanism now instead of
> > > > waiting for the proper fix?
> > > > If we don’t agree on doing this fix in master, I can do it in a fork
> > and
> > > > use it in my setup. So please let me know if you see any other
> > drawbacks in
> > > > the proposed solution.
> > > >
> > > > Denis
> > > >
> > > > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > > > [hidden email]> wrote:
> > > > >
> > > > > Denis Mekhanikov,
> > > > >
> > > > > If we are still talking about "proper" solution the metastore (I've
> > meant
> > > > > of course distributed one) is the way to go.
> > > > >
> > > > > It has a contract to store cluster wide metadata in most efficient
> > way
> > > > and
> > > > > can have any optimization for concurrent writing inside.
> > > > >
> > > > > I'm against creating some duplicating mechanism as you suggested.
> We
> > do
> > > > not
> > > > > need another copy/paste code.
> > > > >
> > > > > Another possibility is to carry metadata along with appropriate
> > request
> > > > if
> > > > > it's not found locally but this is a rather big modification.
> > > > >
> > > > >
> > > > >
> > > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <
> [hidden email]
> > >:
> > > > >
> > > > > > Eduard,
> > > > > >
> > > > > > Usages will wait for the metadata to be registered and written to
> > disk.
> > > > No
> > > > > > races should occur with such flow.
> > > > > > Or do you have some specific case on your mind?
> > > > > >
> > > > > > I agree, that using a distributed meta storage would be nice
> here.
> > > > > > But this way we will kind of move to the previous scheme with a
> > > > replicated
> > > > > > system cache, where metadata was stored before.
> > > > > > Will scheme with the metastorage be different in any way? Won’t
> we
> > > > decide
> > > > > > to move back to discovery messages again after a while?
> > > > > >
> > > > > > Denis
> > > > > >
> > > > > >
> > > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > > How would we deal with races between registration and metadata
> > usages
> > > > > > with
> > > > > > > such fast-fix?
> > > > > > >
> > > > > > > I believe, that we need to move it to distributed metastorage,
> > and
> > > > await
> > > > > > > registration completeness if we can't find it (wait for work in
> > > > > > progress).
> > > > > > > Discovery shouldn't wait for anything here.
> > > > > > >
> > > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > > > [hidden email]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Sergey,
> > > > > > > >
> > > > > > > > Currently metadata is written to disk sequentially on every
> > node. Only
> > > > > > one
> > > > > > > > node at a time is able to write metadata to its storage.
> > > > > > > > Slowness accumulates when you add more nodes. A delay
> required
> > to
> > > > write
> > > > > > > > one piece of metadata may be not that big, but if you
> multiply
> > it by
> > > > say
> > > > > > > > 200, then it becomes noticeable.
> > > > > > > > But If we move the writing out from discovery threads, then
> > nodes will
> > > > > > be
> > > > > > > > doing it in parallel.
> > > > > > > >
> > > > > > > > I think, it’s better to block some threads from a striped
> pool
> > for a
> > > > > > > > little while rather than blocking discovery for the same
> > period, but
> > > > > > > > multiplied by a number of nodes.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Denis
> > > > > > > >
> > > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <
> > [hidden email]
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Denis,
> > > > > > > > >
> > > > > > > > > Thanks for bringing this issue up, decision to write binary
> > metadata
> > > > > > from
> > > > > > > > > discovery thread was really a tough decision to make.
> > > > > > > > > I don't think that moving metadata to metastorage is a
> > silver bullet
> > > > > > here
> > > > > > > > > as this approach also has its drawbacks and is not an easy
> > change.
> > > > > > > > >
> > > > > > > > > In addition to workarounds suggested by Alexei we have two
> > choices to
> > > > > > > > > offload write operation from discovery thread:
> > > > > > > > >
> > > > > > > > > 1. Your scheme with a separate writer thread and futures
> > completed
> > > > > > when
> > > > > > > > > write operation is finished.
> > > > > > > > > 2. PME-like protocol with obvious complications like
> > failover and
> > > > > > > > > asynchronous wait for replies over communication layer.
> > > > > > > > >
> > > > > > > > > Your suggestion looks easier from code complexity
> > perspective but in
> > > > my
> > > > > > > > > view it increases chances to get into starvation. Now if
> > some node
> > > > > > faces
> > > > > > > > > really long delays during write op it is gonna be kicked
> out
> > of
> > > > > > topology
> > > > > > > > by
> > > > > > > > > discovery protocol. In your case it is possible that more
> > and more
> > > > > > > > threads
> > > > > > > > > from other pools may stuck waiting on the operation future,
> > it is
> > > > also
> > > > > > > > not
> > > > > > > > > good.
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > I also think that if we want to approach this issue
> > systematically,
> > > > we
> > > > > > > > need
> > > > > > > > > to do a deep analysis of metastorage option as well and to
> > finally
> > > > > > choose
> > > > > > > > > which road we wanna go.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > > > <[hidden email]> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 1. Yes, only on OS failures. In such case data will
> be
> > received
> > > > from
> > > > > > > > > > alive
> > > > > > > > > > > > nodes later.
> > > > > > > > > > What behavior would be in case of one node ? I suppose
> > someone can
> > > > > > > > obtain
> > > > > > > > > > cache data without unmarshalling schema, what in this
> case
> > would be
> > > > > > with
> > > > > > > > > > grid operability?
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 2. Yes, for walmode=FSYNC writes to metastore will be
> > slow. But
> > > > such
> > > > > > > > > > mode
> > > > > > > > > > > > should not be used if you have more than two nodes in
> > grid because
> > > > > > it
> > > > > > > > > > has
> > > > > > > > > > > > huge impact on performance.
> > > > > > > > > > Is wal mode affects metadata store ?
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> > > > > > [hidden email]
> > > > > > > > > > > :
> > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for showing interest in this issue!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alexey,
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I think removing fsync could help to mitigate
> > performance issues
> > > > > > > > with
> > > > > > > > > > > > > current implementation
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is my understanding correct, that if we remove
> > fsync, then
> > > > > > discovery
> > > > > > > > > > won’t
> > > > > > > > > > > > > be blocked, and data will be flushed to disk in
> > background, and
> > > > > > loss
> > > > > > > > of
> > > > > > > > > > > > > information will be possible only on OS failure? It
> > sounds like
> > > > an
> > > > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Will moving metadata to metastore actually resolve
> > this issue?
> > > > > > Please
> > > > > > > > > > > > > correct me if I’m wrong, but we will still need to
> > write the
> > > > > > > > > > information to
> > > > > > > > > > > > > WAL before releasing the discovery thread. If WAL
> > mode is FSYNC,
> > > > > > then
> > > > > > > > > > the
> > > > > > > > > > > > > issue will still be there. Or is it planned to
> > abandon the
> > > > > > > > > > discovery-based
> > > > > > > > > > > > > protocol at all?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > > > >
> > > > > > > > > > > > > In my particular case the data wasn’t too big. It
> > was a slow
> > > > > > > > > > virtualised
> > > > > > > > > > > > > disk with encryption, that made operations slow.
> > Given that there
> > > > > > are
> > > > > > > > > > 200
> > > > > > > > > > > > > nodes in a cluster, where every node writes slowly,
> > and this
> > > > > > process
> > > > > > > > is
> > > > > > > > > > > > > sequential, one piece of metadata is registered
> > extremely slowly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> > caches? Or is
> > > > it
> > > > > > > > so
> > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > >
> > > > > > > > > > > > > It should be checked, if it’s safe to stop writing
> > marshaller
> > > > > > > > mappings
> > > > > > > > > > to
> > > > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > > > But anyway, I would like to have a property, that
> > would control
> > > > > > this.
> > > > > > > > > > If
> > > > > > > > > > > > > metadata registration is slow, then initial cluster
> > warmup may
> > > > > > take a
> > > > > > > > > > > > > while. So, if we preserve metadata on disk, then we
> > will need to
> > > > > > warm
> > > > > > > > > > it up
> > > > > > > > > > > > > only once, and further restarts won’t be affected.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like a fix, that could be implemented now,
> > since the
> > > > > > activity
> > > > > > > > > > with
> > > > > > > > > > > > > moving metadata to metastore doesn’t sound like a
> > quick one.
> > > > > > Having a
> > > > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On 14 Aug 2019, at 11:53, Павлухин Иван <
> > [hidden email] >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Denis,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > > > 1. Do you have an idea why metadata registration
> > takes so long?
> > > > So
> > > > > > > > > > > > > > poor disks? So many data to write? A contention
> > with disk writes
> > > > > > by
> > > > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> > caches? Or is
> > > > it
> > > > > > > > so
> > > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Generally, I think that it is possible to move
> > metadata saving
> > > > > > > > > > > > > > operations out of discovery thread without
> loosing
> > required
> > > > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As Alex mentioned using metastore looks like a
> > better solution.
> > > > Do
> > > > > > > > we
> > > > > > > > > > > > > > really need a fast fix here? (Are we talking
> about
> > fast fix?)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> > > > > > > > > > > > > < [hidden email] >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Alexey, but in this case customer need to be
> > informed, that
> > > > whole
> > > > > > > > > > (for
> > > > > > > > > > > > > example 1 node) cluster crash (power off) could
> lead
> > to partial
> > > > > > data
> > > > > > > > > > > > > unavailability.
> > > > > > > > > > > > > > > And may be further index corruption.
> > > > > > > > > > > > > > > 1. Why your meta takes a substantial size? may
> > be context
> > > > > > leaking ?
> > > > > > > > > > > > > > > 2. Could meta be compressed ?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Среда, 14 августа 2019, 11:22 +03:00 от
> Alexei
> > Scherbakov <
> > > > > > > > > > > > > [hidden email] >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Currently metadata are fsync'ed on write.
> This
> > might be the
> > > > case
> > > > > > > > of
> > > > > > > > > > > > > > > > slow-downs in case of metadata burst writes.
> > > > > > > > > > > > > > > > I think removing fsync could help to mitigate
> > performance
> > > > issues
> > > > > > > > > > with
> > > > > > > > > > > > > > > > current implementation until proper solution
> > will be
> > > > > > implemented:
> > > > > > > > > > > > > moving
> > > > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > вт, 13 авг. 2019 г. в 17:09, Denis
> Mekhanikov <
> > > > > > > > > > [hidden email]
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I would also like to mention, that
> > marshaller mappings are
> > > > > > > > written
> > > > > > > > > > to
> > > > > > > > > > > > > disk
> > > > > > > > > > > > > > > > > even if persistence is disabled.
> > > > > > > > > > > > > > > > > So, this issue affects purely in-memory
> > clusters as well.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On 13 Aug 2019, at 17:06, Denis
> Mekhanikov
> > <
> > > > > > > > > > [hidden email] >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > When persistence is enabled, binary
> > metadata is written to
> > > > > > disk
> > > > > > > > > > upon
> > > > > > > > > > > > > > > > > registration. Currently it happens in the
> > discovery thread,
> > > > > > which
> > > > > > > > > > > > > makes
> > > > > > > > > > > > > > > > > processing of related messages very slow.
> > > > > > > > > > > > > > > > > > There are cases, when a lot of nodes and
> > slow disks can make
> > > > > > > > every
> > > > > > > > > > > > > > > > > binary type be registered for several
> > minutes. Plus it blocks
> > > > > > > > > > > > > processing of
> > > > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I propose starting a separate thread that
> > will be
> > > > responsible
> > > > > > > > for
> > > > > > > > > > > > > > > > > writing binary metadata to disk. So, binary
> > type registration
> > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > > > > > > considered finished before information
> about
> > it will is
> > > > written
> > > > > > > > to
> > > > > > > > > > > > > disks on
> > > > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The main concern here is data consistency
> > in cases when a
> > > > node
> > > > > > > > > > > > > > > > > acknowledges type registration and then
> > fails before writing
> > > > > > the
> > > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > > > I see two parts of this issue:
> > > > > > > > > > > > > > > > > > Nodes will have different metadata after
> > restarting.
> > > > > > > > > > > > > > > > > > If we write some data into a persisted
> > cache and shut down
> > > > > > nodes
> > > > > > > > > > > > > faster
> > > > > > > > > > > > > > > > > than a new binary type is written to disk,
> > then after a
> > > > restart
> > > > > > > > we
> > > > > > > > > > > > > won’t
> > > > > > > > > > > > > > > > > have a binary type to work with.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The first case is similar to a situation,
> > when one node
> > > > fails,
> > > > > > > > and
> > > > > > > > > > > > > after
> > > > > > > > > > > > > > > > > that a new type is registered in the
> > cluster. This issue is
> > > > > > > > > > resolved
> > > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > > discovery data exchange. All nodes receive
> > information about
> > > > > > all
> > > > > > > > > > > > > binary
> > > > > > > > > > > > > > > > > types in the initial discovery messages
> sent
> > by other nodes.
> > > > > > So,
> > > > > > > > > > once
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > restart a node, it will receive
> information,
> > that it failed
> > > > to
> > > > > > > > > > finish
> > > > > > > > > > > > > > > > > writing to disk, from other nodes.
> > > > > > > > > > > > > > > > > > If all nodes shut down before finishing
> > writing the metadata
> > > > > > to
> > > > > > > > > > disk,
> > > > > > > > > > > > > > > > > then after a restart the type will be
> > considered
> > > > unregistered,
> > > > > > so
> > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > registration will be required.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The second case is a bit more
> complicated.
> > But it can be
> > > > > > > > resolved
> > > > > > > > > > by
> > > > > > > > > > > > > > > > > making the discovery threads on every node
> > create a future,
> > > > > > that
> > > > > > > > > > will
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > completed when writing to disk is finished.
> > So, every node
> > > > will
> > > > > > > > > > have
> > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > future, that will reflect the current state
> > of persisting the
> > > > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > > > After that, if some operation needs this
> > binary type, it
> > > > will
> > > > > > > > > > need to
> > > > > > > > > > > > > > > > > wait on that future until flushing to disk
> > is finished.
> > > > > > > > > > > > > > > > > > This way discovery threads won’t be
> > blocked, but other
> > > > > > threads,
> > > > > > > > > > that
> > > > > > > > > > > > > > > > > actually need this type, will be.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Please let me know what you think about
> > that.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Best regards,
> > > > > Alexei Scherbakov
> > > >
> > > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov
>

Denis Mekhanikov

Re: Asynchronous registration of binary metadata

Sergey,

Yes, your understanding is similar to mine.

I created a JIRA ticket for this change: https://issues.apache.org/jira/browse/IGNITE-12099

Denis
On 23 Aug 2019, 14:27 +0300, Sergey Chugunov <[hidden email]>, wrote:

> Alexei, If my understanding is correct (Denis please correct me if I'm
> wrong) we'll indeed delay only reqs that touch "dirty" metadata (metadata
> with unfinished write to disk).
>
> I don't expect significant performance impact here because for now we don't
> allow other threads to use "dirty" metadata anyway and declare it "clean"
> only when it is fully written to disk.
>
> As far as I can see the only source of performance degradation here would
> be additional handing-off "write metadata tasks" between discovery thread
> and "writer" thread. But this should be minor comparing to IO operations.
>
> On Fri, Aug 23, 2019 at 2:02 PM Alexei Scherbakov <
> [hidden email]> wrote:
>
> > Do I understand correctly what only affected requests with "dirty" metadata
> > will be delayed, but not all ?
> > Doesn't this check hurt performance? Otherwise ALL requests will be blocked
> > until some unrelated metadata is written which is highly undesirable.
> >
> > Otherwise looks good if performance will not be affected by implementation.
> >
> >
> > чт, 22 авг. 2019 г. в 15:18, Denis Mekhanikov <[hidden email]>:
> >
> > > Alexey,
> > >
> > > Making only one node write metadata to disk synchronously is a possible
> > > and easy to implement solution, but it still has a few drawbacks:
> > >
> > > • Discovery will still be blocked on one node. This is better than
> > > blocking all nodes one by one, but disk write may take indefinite time,
> > so
> > > discovery may still be affected.
> > > • There is an unlikely but at the same time an unpleasant case:
> > > 1. A coordinator writes metadata synchronously to disk and finalizes
> > > the metadata registration. Other nodes do it asynchronously, so actual
> > > fsync to a disk may be delayed.
> > > 2. A transaction is committed.
> > > 3. The cluster is shut down before all nodes finish their fsync of
> > > metadata.
> > > 4. Nodes are started again one by one.
> > > 5. Before the previous coordinator is started again, a read operation
> > > tries to read the data, that uses the metadata that wasn’t fsynced
> > anywhere
> > > except the coordinator, which is still not started.
> > > 6. Error about unknown metadata is generated.
> > >
> > > In the scheme, that Sergey and me proposed, this situation isn’t
> > possible,
> > > since the data won’t be written to disk until fsync is finished. Every
> > > mapped node will wait on a future until metadata is written to disk
> > before
> > > performing any cache changes.
> > > What do you think about such fix?
> > >
> > > Denis
> > > On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <
> > > [hidden email]>, wrote:
> > > > Denis Mekhanikov,
> > > >
> > > > I think at least one node (coordinator for example) still should write
> > > > metadata synchronously to protect from a scenario:
> > > >
> > > > tx creating new metadata is commited <- all nodes in grid are failed
> > > > (powered off) <- async writing to disk is completed
> > > >
> > > > where <- means "happens before"
> > > >
> > > > All other nodes could write asynchronously, by using separate thread or
> > > not
> > > > doing fsync( same effect)
> > > >
> > > >
> > > >
> > > > ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <[hidden email]>:
> > > >
> > > > > Alexey,
> > > > >
> > > > > I’m not suggesting to duplicate anything.
> > > > > My point is that the proper fix will be implemented in a relatively
> > > > > distant future. Why not improve the existing mechanism now instead of
> > > > > waiting for the proper fix?
> > > > > If we don’t agree on doing this fix in master, I can do it in a fork
> > > and
> > > > > use it in my setup. So please let me know if you see any other
> > > drawbacks in
> > > > > the proposed solution.
> > > > >
> > > > > Denis
> > > > >
> > > > > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > > > > [hidden email]> wrote:
> > > > > >
> > > > > > Denis Mekhanikov,
> > > > > >
> > > > > > If we are still talking about "proper" solution the metastore (I've
> > > meant
> > > > > > of course distributed one) is the way to go.
> > > > > >
> > > > > > It has a contract to store cluster wide metadata in most efficient
> > > way
> > > > > and
> > > > > > can have any optimization for concurrent writing inside.
> > > > > >
> > > > > > I'm against creating some duplicating mechanism as you suggested.
> > We
> > > do
> > > > > not
> > > > > > need another copy/paste code.
> > > > > >
> > > > > > Another possibility is to carry metadata along with appropriate
> > > request
> > > > > if
> > > > > > it's not found locally but this is a rather big modification.
> > > > > >
> > > > > >
> > > > > >
> > > > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <
> > [hidden email]
> > > > :
> > > > > >
> > > > > > > Eduard,
> > > > > > >
> > > > > > > Usages will wait for the metadata to be registered and written to
> > > disk.
> > > > > No
> > > > > > > races should occur with such flow.
> > > > > > > Or do you have some specific case on your mind?
> > > > > > >
> > > > > > > I agree, that using a distributed meta storage would be nice
> > here.
> > > > > > > But this way we will kind of move to the previous scheme with a
> > > > > replicated
> > > > > > > system cache, where metadata was stored before.
> > > > > > > Will scheme with the metastorage be different in any way? Won’t
> > we
> > > > > decide
> > > > > > > to move back to discovery messages again after a while?
> > > > > > >
> > > > > > > Denis
> > > > > > >
> > > > > > >
> > > > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > > How would we deal with races between registration and metadata
> > > usages
> > > > > > > with
> > > > > > > > such fast-fix?
> > > > > > > >
> > > > > > > > I believe, that we need to move it to distributed metastorage,
> > > and
> > > > > await
> > > > > > > > registration completeness if we can't find it (wait for work in
> > > > > > > progress).
> > > > > > > > Discovery shouldn't wait for anything here.
> > > > > > > >
> > > > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > > > > [hidden email]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Sergey,
> > > > > > > > >
> > > > > > > > > Currently metadata is written to disk sequentially on every
> > > node. Only
> > > > > > > one
> > > > > > > > > node at a time is able to write metadata to its storage.
> > > > > > > > > Slowness accumulates when you add more nodes. A delay
> > required
> > > to
> > > > > write
> > > > > > > > > one piece of metadata may be not that big, but if you
> > multiply
> > > it by
> > > > > say
> > > > > > > > > 200, then it becomes noticeable.
> > > > > > > > > But If we move the writing out from discovery threads, then
> > > nodes will
> > > > > > > be
> > > > > > > > > doing it in parallel.
> > > > > > > > >
> > > > > > > > > I think, it’s better to block some threads from a striped
> > pool
> > > for a
> > > > > > > > > little while rather than blocking discovery for the same
> > > period, but
> > > > > > > > > multiplied by a number of nodes.
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Denis
> > > > > > > > >
> > > > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <
> > > [hidden email]
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Denis,
> > > > > > > > > >
> > > > > > > > > > Thanks for bringing this issue up, decision to write binary
> > > metadata
> > > > > > > from
> > > > > > > > > > discovery thread was really a tough decision to make.
> > > > > > > > > > I don't think that moving metadata to metastorage is a
> > > silver bullet
> > > > > > > here
> > > > > > > > > > as this approach also has its drawbacks and is not an easy
> > > change.
> > > > > > > > > >
> > > > > > > > > > In addition to workarounds suggested by Alexei we have two
> > > choices to
> > > > > > > > > > offload write operation from discovery thread:
> > > > > > > > > >
> > > > > > > > > > 1. Your scheme with a separate writer thread and futures
> > > completed
> > > > > > > when
> > > > > > > > > > write operation is finished.
> > > > > > > > > > 2. PME-like protocol with obvious complications like
> > > failover and
> > > > > > > > > > asynchronous wait for replies over communication layer.
> > > > > > > > > >
> > > > > > > > > > Your suggestion looks easier from code complexity
> > > perspective but in
> > > > > my
> > > > > > > > > > view it increases chances to get into starvation. Now if
> > > some node
> > > > > > > faces
> > > > > > > > > > really long delays during write op it is gonna be kicked
> > out
> > > of
> > > > > > > topology
> > > > > > > > > by
> > > > > > > > > > discovery protocol. In your case it is possible that more
> > > and more
> > > > > > > > > threads
> > > > > > > > > > from other pools may stuck waiting on the operation future,
> > > it is
> > > > > also
> > > > > > > > > not
> > > > > > > > > > good.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > I also think that if we want to approach this issue
> > > systematically,
> > > > > we
> > > > > > > > > need
> > > > > > > > > > to do a deep analysis of metastorage option as well and to
> > > finally
> > > > > > > choose
> > > > > > > > > > which road we wanna go.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > > > > <[hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 1. Yes, only on OS failures. In such case data will
> > be
> > > received
> > > > > from
> > > > > > > > > > > alive
> > > > > > > > > > > > > nodes later.
> > > > > > > > > > > What behavior would be in case of one node ? I suppose
> > > someone can
> > > > > > > > > obtain
> > > > > > > > > > > cache data without unmarshalling schema, what in this
> > case
> > > would be
> > > > > > > with
> > > > > > > > > > > grid operability?
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. Yes, for walmode=FSYNC writes to metastore will be
> > > slow. But
> > > > > such
> > > > > > > > > > > mode
> > > > > > > > > > > > > should not be used if you have more than two nodes in
> > > grid because
> > > > > > > it
> > > > > > > > > > > has
> > > > > > > > > > > > > huge impact on performance.
> > > > > > > > > > > Is wal mode affects metadata store ?
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> > > > > > > [hidden email]
> > > > > > > > > > > > :
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for showing interest in this issue!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Alexey,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think removing fsync could help to mitigate
> > > performance issues
> > > > > > > > > with
> > > > > > > > > > > > > > current implementation
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is my understanding correct, that if we remove
> > > fsync, then
> > > > > > > discovery
> > > > > > > > > > > won’t
> > > > > > > > > > > > > > be blocked, and data will be flushed to disk in
> > > background, and
> > > > > > > loss
> > > > > > > > > of
> > > > > > > > > > > > > > information will be possible only on OS failure? It
> > > sounds like
> > > > > an
> > > > > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Will moving metadata to metastore actually resolve
> > > this issue?
> > > > > > > Please
> > > > > > > > > > > > > > correct me if I’m wrong, but we will still need to
> > > write the
> > > > > > > > > > > information to
> > > > > > > > > > > > > > WAL before releasing the discovery thread. If WAL
> > > mode is FSYNC,
> > > > > > > then
> > > > > > > > > > > the
> > > > > > > > > > > > > > issue will still be there. Or is it planned to
> > > abandon the
> > > > > > > > > > > discovery-based
> > > > > > > > > > > > > > protocol at all?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In my particular case the data wasn’t too big. It
> > > was a slow
> > > > > > > > > > > virtualised
> > > > > > > > > > > > > > disk with encryption, that made operations slow.
> > > Given that there
> > > > > > > are
> > > > > > > > > > > 200
> > > > > > > > > > > > > > nodes in a cluster, where every node writes slowly,
> > > and this
> > > > > > > process
> > > > > > > > > is
> > > > > > > > > > > > > > sequential, one piece of metadata is registered
> > > extremely slowly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> > > caches? Or is
> > > > > it
> > > > > > > > > so
> > > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It should be checked, if it’s safe to stop writing
> > > marshaller
> > > > > > > > > mappings
> > > > > > > > > > > to
> > > > > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > > > > But anyway, I would like to have a property, that
> > > would control
> > > > > > > this.
> > > > > > > > > > > If
> > > > > > > > > > > > > > metadata registration is slow, then initial cluster
> > > warmup may
> > > > > > > take a
> > > > > > > > > > > > > > while. So, if we preserve metadata on disk, then we
> > > will need to
> > > > > > > warm
> > > > > > > > > > > it up
> > > > > > > > > > > > > > only once, and further restarts won’t be affected.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would like a fix, that could be implemented now,
> > > since the
> > > > > > > activity
> > > > > > > > > > > with
> > > > > > > > > > > > > > moving metadata to metastore doesn’t sound like a
> > > quick one.
> > > > > > > Having a
> > > > > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 14 Aug 2019, at 11:53, Павлухин Иван <
> > > [hidden email] >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > > > > 1. Do you have an idea why metadata registration
> > > takes so long?
> > > > > So
> > > > > > > > > > > > > > > poor disks? So many data to write? A contention
> > > with disk writes
> > > > > > > by
> > > > > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> > > caches? Or is
> > > > > it
> > > > > > > > > so
> > > > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Generally, I think that it is possible to move
> > > metadata saving
> > > > > > > > > > > > > > > operations out of discovery thread without
> > loosing
> > > required
> > > > > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > As Alex mentioned using metastore looks like a
> > > better solution.
> > > > > Do
> > > > > > > > > we
> > > > > > > > > > > > > > > really need a fast fix here? (Are we talking
> > about
> > > fast fix?)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> > > > > > > > > > > > > > < [hidden email] >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Alexey, but in this case customer need to be
> > > informed, that
> > > > > whole
> > > > > > > > > > > (for
> > > > > > > > > > > > > > example 1 node) cluster crash (power off) could
> > lead
> > > to partial
> > > > > > > data
> > > > > > > > > > > > > > unavailability.
> > > > > > > > > > > > > > > > And may be further index corruption.
> > > > > > > > > > > > > > > > 1. Why your meta takes a substantial size? may
> > > be context
> > > > > > > leaking ?
> > > > > > > > > > > > > > > > 2. Could meta be compressed ?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Среда, 14 августа 2019, 11:22 +03:00 от
> > Alexei
> > > Scherbakov <
> > > > > > > > > > > > > > [hidden email] >:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Currently metadata are fsync'ed on write.
> > This
> > > might be the
> > > > > case
> > > > > > > > > of
> > > > > > > > > > > > > > > > > slow-downs in case of metadata burst writes.
> > > > > > > > > > > > > > > > > I think removing fsync could help to mitigate
> > > performance
> > > > > issues
> > > > > > > > > > > with
> > > > > > > > > > > > > > > > > current implementation until proper solution
> > > will be
> > > > > > > implemented:
> > > > > > > > > > > > > > moving
> > > > > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > вт, 13 авг. 2019 г. в 17:09, Denis
> > Mekhanikov <
> > > > > > > > > > > [hidden email]
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I would also like to mention, that
> > > marshaller mappings are
> > > > > > > > > written
> > > > > > > > > > > to
> > > > > > > > > > > > > > disk
> > > > > > > > > > > > > > > > > > even if persistence is disabled.
> > > > > > > > > > > > > > > > > > So, this issue affects purely in-memory
> > > clusters as well.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On 13 Aug 2019, at 17:06, Denis
> > Mekhanikov
> > > <
> > > > > > > > > > > [hidden email] >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > When persistence is enabled, binary
> > > metadata is written to
> > > > > > > disk
> > > > > > > > > > > upon
> > > > > > > > > > > > > > > > > > registration. Currently it happens in the
> > > discovery thread,
> > > > > > > which
> > > > > > > > > > > > > > makes
> > > > > > > > > > > > > > > > > > processing of related messages very slow.
> > > > > > > > > > > > > > > > > > > There are cases, when a lot of nodes and
> > > slow disks can make
> > > > > > > > > every
> > > > > > > > > > > > > > > > > > binary type be registered for several
> > > minutes. Plus it blocks
> > > > > > > > > > > > > > processing of
> > > > > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I propose starting a separate thread that
> > > will be
> > > > > responsible
> > > > > > > > > for
> > > > > > > > > > > > > > > > > > writing binary metadata to disk. So, binary
> > > type registration
> > > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > considered finished before information
> > about
> > > it will is
> > > > > written
> > > > > > > > > to
> > > > > > > > > > > > > > disks on
> > > > > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The main concern here is data consistency
> > > in cases when a
> > > > > node
> > > > > > > > > > > > > > > > > > acknowledges type registration and then
> > > fails before writing
> > > > > > > the
> > > > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > > > > I see two parts of this issue:
> > > > > > > > > > > > > > > > > > > Nodes will have different metadata after
> > > restarting.
> > > > > > > > > > > > > > > > > > > If we write some data into a persisted
> > > cache and shut down
> > > > > > > nodes
> > > > > > > > > > > > > > faster
> > > > > > > > > > > > > > > > > > than a new binary type is written to disk,
> > > then after a
> > > > > restart
> > > > > > > > > we
> > > > > > > > > > > > > > won’t
> > > > > > > > > > > > > > > > > > have a binary type to work with.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The first case is similar to a situation,
> > > when one node
> > > > > fails,
> > > > > > > > > and
> > > > > > > > > > > > > > after
> > > > > > > > > > > > > > > > > > that a new type is registered in the
> > > cluster. This issue is
> > > > > > > > > > > resolved
> > > > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > > > discovery data exchange. All nodes receive
> > > information about
> > > > > > > all
> > > > > > > > > > > > > > binary
> > > > > > > > > > > > > > > > > > types in the initial discovery messages
> > sent
> > > by other nodes.
> > > > > > > So,
> > > > > > > > > > > once
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > restart a node, it will receive
> > information,
> > > that it failed
> > > > > to
> > > > > > > > > > > finish
> > > > > > > > > > > > > > > > > > writing to disk, from other nodes.
> > > > > > > > > > > > > > > > > > > If all nodes shut down before finishing
> > > writing the metadata
> > > > > > > to
> > > > > > > > > > > disk,
> > > > > > > > > > > > > > > > > > then after a restart the type will be
> > > considered
> > > > > unregistered,
> > > > > > > so
> > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > > registration will be required.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The second case is a bit more
> > complicated.
> > > But it can be
> > > > > > > > > resolved
> > > > > > > > > > > by
> > > > > > > > > > > > > > > > > > making the discovery threads on every node
> > > create a future,
> > > > > > > that
> > > > > > > > > > > will
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > completed when writing to disk is finished.
> > > So, every node
> > > > > will
> > > > > > > > > > > have
> > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > future, that will reflect the current state
> > > of persisting the
> > > > > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > > > > After that, if some operation needs this
> > > binary type, it
> > > > > will
> > > > > > > > > > > need to
> > > > > > > > > > > > > > > > > > wait on that future until flushing to disk
> > > is finished.
> > > > > > > > > > > > > > > > > > > This way discovery threads won’t be
> > > blocked, but other
> > > > > > > threads,
> > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > actually need this type, will be.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Please let me know what you think about
> > > that.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Best regards,
> > > > > > Alexei Scherbakov
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > >
> >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
> >