Apache Ignite Developers - Legacy Mail Archive

New definition for affinity node (issues with baseline)

Classic

List

Threaded

16 messages Options

Eduard Shangareev

New definition for affinity node (issues with baseline)

Hi, Igniters,

I want to raise a topic about our affinity node definition.

After adding baseline (affinity) topology (BL(A)T) things start being
complicated.

Plenty of bugs appears:

IGNITE-8173
ignite.getOrCreateCache(cacheConfig).iterator() method works incorrect for
replicated cache in case if some data node isn't in baseline

IGNITE-7628
SqlQuery hangs indefinitely with additional not registered in baseline node.

It's because everything relies on concept "affinity node".
And until now it was as simple as a server node which passes node filter.
Other words any server node which is not filtered out by node filter.

But node which is not in BL(A)T and which passes node filter would be
treated as affinity node. And it's definitely wrong. At least, it is a
source of many bugs (I believe there are much more than those 2 which I
already have mentioned).

It's clear that this definition should be changed.
Let's start with a new definition of "Affinity topology". Affinity topology
is a set of nodes which potentially could keep data.

If we use knowledge about the current realization we can say that 1. for
in-memory cache groups it would be all server nodes;
2. for persistent cache groups it would be BL(A)T.

I will further use Dynamic Affinity Topology or DAT for 1 (in-memory cache
groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd point.

Denote node filter as f(X), where X is affinity topology.

Then we can say that node A is affinity node if
A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.

It worth to mention that AT' should be used to pass to affinity function of
cache groups.
Also, AT and AT' could change during the time (BL(A)T changes or node
joins/disconnections).

And I don't like fact that usage of DAT or SAT relies on persistence
settings (Should we make it configurable per cache group?).

Ok, I have created a ticket to implement this changes and will start
working on it.
https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
calculation doesn't take into account BLT).

Also, I want to use these definitions (Affinity Topology, Affinity Node,
DAT, SAT) in documentation and java docs.

Maybe, we also should consider replacing BL(A)T with SAT.

Thank you for your attention.

Ivan Rakov

Re: New definition for affinity node (issues with baseline)

Eduard,

Can you please summarize code changes that you are proposing?
I agree that BLT is a bit misleading term and DAT/SAT make more sense.
However, establishing a consensus on v2.4 Baseline Topology terminology
took a long time and seems like you are going to cause a bit more
perturbations.
I still don't understand what and how should be changed. Please provide
summary of upcoming class renamings and changes of existing system parts.

Best Regards,
Ivan Rakov

On 24.04.2018 17:46, Eduard Shangareev wrote:

> Hi, Igniters,
>
> I want to raise a topic about our affinity node definition.
>
> After adding baseline (affinity) topology (BL(A)T) things start being
> complicated.
>
> Plenty of bugs appears:
>
> IGNITE-8173
> ignite.getOrCreateCache(cacheConfig).iterator() method works incorrect for
> replicated cache in case if some data node isn't in baseline
>
> IGNITE-7628
> SqlQuery hangs indefinitely with additional not registered in baseline node.
>
> It's because everything relies on concept "affinity node".
> And until now it was as simple as a server node which passes node filter.
> Other words any server node which is not filtered out by node filter.
>
> But node which is not in BL(A)T and which passes node filter would be
> treated as affinity node. And it's definitely wrong. At least, it is a
> source of many bugs (I believe there are much more than those 2 which I
> already have mentioned).
>
> It's clear that this definition should be changed.
> Let's start with a new definition of "Affinity topology". Affinity topology
> is a set of nodes which potentially could keep data.
>
> If we use knowledge about the current realization we can say that 1. for
> in-memory cache groups it would be all server nodes;
> 2. for persistent cache groups it would be BL(A)T.
>
> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory cache
> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd point.
>
> Denote node filter as f(X), where X is affinity topology.
>
> Then we can say that node A is affinity node if
> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
>
> It worth to mention that AT' should be used to pass to affinity function of
> cache groups.
> Also, AT and AT' could change during the time (BL(A)T changes or node
> joins/disconnections).
>
> And I don't like fact that usage of DAT or SAT relies on persistence
> settings (Should we make it configurable per cache group?).
>
> Ok, I have created a ticket to implement this changes and will start
> working on it.
> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> calculation doesn't take into account BLT).
>
> Also, I want to use these definitions (Affinity Topology, Affinity Node,
> DAT, SAT) in documentation and java docs.
>
> Maybe, we also should consider replacing BL(A)T with SAT.
>
> Thank you for your attention.
>

Vladimir Ozerov

Re: New definition for affinity node (issues with baseline)

Guys,

As a user I definitely do not want to think about BLATs, SATs, DATs,
whatsoever. I want to query data, iterate over data, send compute tasks to
data. If certain node is outside of BLAT and do not have data, then this is
not affinity node. Can we just fix affinity logic to take in count BLAT
appropriately?

On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]> wrote:

> Eduard,
>
> Can you please summarize code changes that you are proposing?
> I agree that BLT is a bit misleading term and DAT/SAT make more sense.
> However, establishing a consensus on v2.4 Baseline Topology terminology
> took a long time and seems like you are going to cause a bit more
> perturbations.
> I still don't understand what and how should be changed. Please provide
> summary of upcoming class renamings and changes of existing system parts.
>
> Best Regards,
> Ivan Rakov
>
>
> On 24.04.2018 17:46, Eduard Shangareev wrote:
>
>> Hi, Igniters,
>>
>> I want to raise a topic about our affinity node definition.
>>
>> After adding baseline (affinity) topology (BL(A)T) things start being
>> complicated.
>>
>> Plenty of bugs appears:
>>
>> IGNITE-8173
>> ignite.getOrCreateCache(cacheConfig).iterator() method works incorrect
>> for
>> replicated cache in case if some data node isn't in baseline
>>
>> IGNITE-7628
>> SqlQuery hangs indefinitely with additional not registered in baseline
>> node.
>>
>> It's because everything relies on concept "affinity node".
>> And until now it was as simple as a server node which passes node filter.
>> Other words any server node which is not filtered out by node filter.
>>
>> But node which is not in BL(A)T and which passes node filter would be
>> treated as affinity node. And it's definitely wrong. At least, it is a
>> source of many bugs (I believe there are much more than those 2 which I
>> already have mentioned).
>>
>> It's clear that this definition should be changed.
>> Let's start with a new definition of "Affinity topology". Affinity
>> topology
>> is a set of nodes which potentially could keep data.
>>
>> If we use knowledge about the current realization we can say that 1. for
>> in-memory cache groups it would be all server nodes;
>> 2. for persistent cache groups it would be BL(A)T.
>>
>> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory cache
>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd point.
>>
>> Denote node filter as f(X), where X is affinity topology.
>>
>> Then we can say that node A is affinity node if
>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
>>
>> It worth to mention that AT' should be used to pass to affinity function
>> of
>> cache groups.
>> Also, AT and AT' could change during the time (BL(A)T changes or node
>> joins/disconnections).
>>
>> And I don't like fact that usage of DAT or SAT relies on persistence
>> settings (Should we make it configurable per cache group?).
>>
>> Ok, I have created a ticket to implement this changes and will start
>> working on it.
>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
>> calculation doesn't take into account BLT).
>>
>> Also, I want to use these definitions (Affinity Topology, Affinity Node,
>> DAT, SAT) in documentation and java docs.
>>
>> Maybe, we also should consider replacing BL(A)T with SAT.
>>
>> Thank you for your attention.
>>
>>
>

Eduard Shangareev

Re: New definition for affinity node (issues with baseline)

Vladimir,

It will be fixed, But it is not user-list.

We (developers) should decide ourselves how to go ahead with these concepts.

And I think that our old approach to describe BLAT is sophisticated and not
clear (maybe, even error-prone).

On Tue, Apr 24, 2018 at 6:28 PM, Vladimir Ozerov <[hidden email]>
wrote:

> Guys,
>
> As a user I definitely do not want to think about BLATs, SATs, DATs,
> whatsoever. I want to query data, iterate over data, send compute tasks to
> data. If certain node is outside of BLAT and do not have data, then this is
> not affinity node. Can we just fix affinity logic to take in count BLAT
> appropriately?
>
> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]> wrote:
>
> > Eduard,
> >
> > Can you please summarize code changes that you are proposing?
> > I agree that BLT is a bit misleading term and DAT/SAT make more sense.
> > However, establishing a consensus on v2.4 Baseline Topology terminology
> > took a long time and seems like you are going to cause a bit more
> > perturbations.
> > I still don't understand what and how should be changed. Please provide
> > summary of upcoming class renamings and changes of existing system parts.
> >
> > Best Regards,
> > Ivan Rakov
> >
> >
> > On 24.04.2018 17:46, Eduard Shangareev wrote:
> >
> >> Hi, Igniters,
> >>
> >> I want to raise a topic about our affinity node definition.
> >>
> >> After adding baseline (affinity) topology (BL(A)T) things start being
> >> complicated.
> >>
> >> Plenty of bugs appears:
> >>
> >> IGNITE-8173
> >> ignite.getOrCreateCache(cacheConfig).iterator() method works incorrect
> >> for
> >> replicated cache in case if some data node isn't in baseline
> >>
> >> IGNITE-7628
> >> SqlQuery hangs indefinitely with additional not registered in baseline
> >> node.
> >>
> >> It's because everything relies on concept "affinity node".
> >> And until now it was as simple as a server node which passes node
> filter.
> >> Other words any server node which is not filtered out by node filter.
> >>
> >> But node which is not in BL(A)T and which passes node filter would be
> >> treated as affinity node. And it's definitely wrong. At least, it is a
> >> source of many bugs (I believe there are much more than those 2 which I
> >> already have mentioned).
> >>
> >> It's clear that this definition should be changed.
> >> Let's start with a new definition of "Affinity topology". Affinity
> >> topology
> >> is a set of nodes which potentially could keep data.
> >>
> >> If we use knowledge about the current realization we can say that 1. for
> >> in-memory cache groups it would be all server nodes;
> >> 2. for persistent cache groups it would be BL(A)T.
> >>
> >> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory
> cache
> >> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
> point.
> >>
> >> Denote node filter as f(X), where X is affinity topology.
> >>
> >> Then we can say that node A is affinity node if
> >> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> >>
> >> It worth to mention that AT' should be used to pass to affinity function
> >> of
> >> cache groups.
> >> Also, AT and AT' could change during the time (BL(A)T changes or node
> >> joins/disconnections).
> >>
> >> And I don't like fact that usage of DAT or SAT relies on persistence
> >> settings (Should we make it configurable per cache group?).
> >>
> >> Ok, I have created a ticket to implement this changes and will start
> >> working on it.
> >> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> >> calculation doesn't take into account BLT).
> >>
> >> Also, I want to use these definitions (Affinity Topology, Affinity Node,
> >> DAT, SAT) in documentation and java docs.
> >>
> >> Maybe, we also should consider replacing BL(A)T with SAT.
> >>
> >> Thank you for your attention.
> >>
> >>
> >
>

Vladimir Ozerov

Re: New definition for affinity node (issues with baseline)

Ed,

Agreed. Can we see proposed API changes?

On Tue, Apr 24, 2018 at 6:39 PM, Eduard Shangareev <
[hidden email]> wrote:

> Vladimir,
>
> It will be fixed, But it is not user-list.
>
> We (developers) should decide ourselves how to go ahead with these
> concepts.
>
> And I think that our old approach to describe BLAT is sophisticated and not
> clear (maybe, even error-prone).
>
> On Tue, Apr 24, 2018 at 6:28 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Guys,
> >
> > As a user I definitely do not want to think about BLATs, SATs, DATs,
> > whatsoever. I want to query data, iterate over data, send compute tasks
> to
> > data. If certain node is outside of BLAT and do not have data, then this
> is
> > not affinity node. Can we just fix affinity logic to take in count BLAT
> > appropriately?
> >
> > On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]>
> wrote:
> >
> > > Eduard,
> > >
> > > Can you please summarize code changes that you are proposing?
> > > I agree that BLT is a bit misleading term and DAT/SAT make more sense.
> > > However, establishing a consensus on v2.4 Baseline Topology terminology
> > > took a long time and seems like you are going to cause a bit more
> > > perturbations.
> > > I still don't understand what and how should be changed. Please provide
> > > summary of upcoming class renamings and changes of existing system
> parts.
> > >
> > > Best Regards,
> > > Ivan Rakov
> > >
> > >
> > > On 24.04.2018 17:46, Eduard Shangareev wrote:
> > >
> > >> Hi, Igniters,
> > >>
> > >> I want to raise a topic about our affinity node definition.
> > >>
> > >> After adding baseline (affinity) topology (BL(A)T) things start being
> > >> complicated.
> > >>
> > >> Plenty of bugs appears:
> > >>
> > >> IGNITE-8173
> > >> ignite.getOrCreateCache(cacheConfig).iterator() method works
> incorrect
> > >> for
> > >> replicated cache in case if some data node isn't in baseline
> > >>
> > >> IGNITE-7628
> > >> SqlQuery hangs indefinitely with additional not registered in baseline
> > >> node.
> > >>
> > >> It's because everything relies on concept "affinity node".
> > >> And until now it was as simple as a server node which passes node
> > filter.
> > >> Other words any server node which is not filtered out by node filter.
> > >>
> > >> But node which is not in BL(A)T and which passes node filter would be
> > >> treated as affinity node. And it's definitely wrong. At least, it is a
> > >> source of many bugs (I believe there are much more than those 2 which
> I
> > >> already have mentioned).
> > >>
> > >> It's clear that this definition should be changed.
> > >> Let's start with a new definition of "Affinity topology". Affinity
> > >> topology
> > >> is a set of nodes which potentially could keep data.
> > >>
> > >> If we use knowledge about the current realization we can say that 1.
> for
> > >> in-memory cache groups it would be all server nodes;
> > >> 2. for persistent cache groups it would be BL(A)T.
> > >>
> > >> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory
> > cache
> > >> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
> > point.
> > >>
> > >> Denote node filter as f(X), where X is affinity topology.
> > >>
> > >> Then we can say that node A is affinity node if
> > >> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> > >>
> > >> It worth to mention that AT' should be used to pass to affinity
> function
> > >> of
> > >> cache groups.
> > >> Also, AT and AT' could change during the time (BL(A)T changes or node
> > >> joins/disconnections).
> > >>
> > >> And I don't like fact that usage of DAT or SAT relies on persistence
> > >> settings (Should we make it configurable per cache group?).
> > >>
> > >> Ok, I have created a ticket to implement this changes and will start
> > >> working on it.
> > >> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> > >> calculation doesn't take into account BLT).
> > >>
> > >> Also, I want to use these definitions (Affinity Topology, Affinity
> Node,
> > >> DAT, SAT) in documentation and java docs.
> > >>
> > >> Maybe, we also should consider replacing BL(A)T with SAT.
> > >>
> > >> Thank you for your attention.
> > >>
> > >>
> > >
> >
>

Stanislav Lukyanov

Re: New definition for affinity node (issues with baseline)

In reply to this post by Vladimir Ozerov

+ for Vladimir's point - adding more complexity may (and likely will) be
even more misleading.

Can we take a step back and discuss why do we need to have different
behavior for persistent and in-memory caches? Can we make in-memory caches
honor baseline instead of special-casing them?

Thanks,
Stan

вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:

Ivan Rakov

Re: New definition for affinity node (issues with baseline)

Stan,

I believe it was discussed at the design proposal thread:
http://apache-ignite-developers.2346864.n4.nabble.com/Cluster-auto-activation-design-proposal-td20295.html

The short answer: backup factor decreases if node leaves. In
non-persistent mode we have to rebalance data ASAP - otherwise last node
that owns partition may fail and data will be lost forever.
This is not necessary if data is persisted to disk storage, that's the
reason for Baseline Topology concept.

Best Regards,
Ivan Rakov

On 24.04.2018 18:48, Stanislav Lukyanov wrote:

> + for Vladimir's point - adding more complexity may (and likely will) be
> even more misleading.
>
> Can we take a step back and discuss why do we need to have different
> behavior for persistent and in-memory caches? Can we make in-memory caches
> honor baseline instead of special-casing them?
>
> Thanks,
> Stan
>
>
> вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:
>
>> Guys,
>>
>> As a user I definitely do not want to think about BLATs, SATs, DATs,
>> whatsoever. I want to query data, iterate over data, send compute tasks to
>> data. If certain node is outside of BLAT and do not have data, then this is
>> not affinity node. Can we just fix affinity logic to take in count BLAT
>> appropriately?
>>
>> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]> wrote:
>>
>>> Eduard,
>>>
>>> Can you please summarize code changes that you are proposing?
>>> I agree that BLT is a bit misleading term and DAT/SAT make more sense.
>>> However, establishing a consensus on v2.4 Baseline Topology terminology
>>> took a long time and seems like you are going to cause a bit more
>>> perturbations.
>>> I still don't understand what and how should be changed. Please provide
>>> summary of upcoming class renamings and changes of existing system parts.
>>>
>>> Best Regards,
>>> Ivan Rakov
>>>
>>>
>>> On 24.04.2018 17:46, Eduard Shangareev wrote:
>>>
>>>> Hi, Igniters,
>>>>
>>>> I want to raise a topic about our affinity node definition.
>>>>
>>>> After adding baseline (affinity) topology (BL(A)T) things start being
>>>> complicated.
>>>>
>>>> Plenty of bugs appears:
>>>>
>>>> IGNITE-8173
>>>> ignite.getOrCreateCache(cacheConfig).iterator() method works incorrect
>>>> for
>>>> replicated cache in case if some data node isn't in baseline
>>>>
>>>> IGNITE-7628
>>>> SqlQuery hangs indefinitely with additional not registered in baseline
>>>> node.
>>>>
>>>> It's because everything relies on concept "affinity node".
>>>> And until now it was as simple as a server node which passes node
>> filter.
>>>> Other words any server node which is not filtered out by node filter.
>>>>
>>>> But node which is not in BL(A)T and which passes node filter would be
>>>> treated as affinity node. And it's definitely wrong. At least, it is a
>>>> source of many bugs (I believe there are much more than those 2 which I
>>>> already have mentioned).
>>>>
>>>> It's clear that this definition should be changed.
>>>> Let's start with a new definition of "Affinity topology". Affinity
>>>> topology
>>>> is a set of nodes which potentially could keep data.
>>>>
>>>> If we use knowledge about the current realization we can say that 1. for
>>>> in-memory cache groups it would be all server nodes;
>>>> 2. for persistent cache groups it would be BL(A)T.
>>>>
>>>> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory
>> cache
>>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
>> point.
>>>> Denote node filter as f(X), where X is affinity topology.
>>>>
>>>> Then we can say that node A is affinity node if
>>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
>>>>
>>>> It worth to mention that AT' should be used to pass to affinity function
>>>> of
>>>> cache groups.
>>>> Also, AT and AT' could change during the time (BL(A)T changes or node
>>>> joins/disconnections).
>>>>
>>>> And I don't like fact that usage of DAT or SAT relies on persistence
>>>> settings (Should we make it configurable per cache group?).
>>>>
>>>> Ok, I have created a ticket to implement this changes and will start
>>>> working on it.
>>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
>>>> calculation doesn't take into account BLT).
>>>>
>>>> Also, I want to use these definitions (Affinity Topology, Affinity Node,
>>>> DAT, SAT) in documentation and java docs.
>>>>
>>>> Maybe, we also should consider replacing BL(A)T with SAT.
>>>>
>>>> Thank you for your attention.
>>>>
>>>>

Vladimir Ozerov

Re: New definition for affinity node (issues with baseline)

Ivan,

This reasoning sounds questionable to me. First, separate logic for in
memory and persistent regions means that we loose collocation between
persistent and non persistent caches. Second, “data is still on disk”
assumption might be not valid if node has left due to disk crash, or when
data is updated on remaining nodes.

вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <[hidden email]>:

> Stan,
>
> I believe it was discussed at the design proposal thread:
>
> http://apache-ignite-developers.2346864.n4.nabble.com/Cluster-auto-activation-design-proposal-td20295.html
>
> The short answer: backup factor decreases if node leaves. In
> non-persistent mode we have to rebalance data ASAP - otherwise last node
> that owns partition may fail and data will be lost forever.
> This is not necessary if data is persisted to disk storage, that's the
> reason for Baseline Topology concept.
>
> Best Regards,
> Ivan Rakov
>
> On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > + for Vladimir's point - adding more complexity may (and likely will) be
> > even more misleading.
> >
> > Can we take a step back and discuss why do we need to have different
> > behavior for persistent and in-memory caches? Can we make in-memory
> caches
> > honor baseline instead of special-casing them?
> >
> > Thanks,
> > Stan
> >
> >
> > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:
> >
> >> Guys,
> >>
> >> As a user I definitely do not want to think about BLATs, SATs, DATs,
> >> whatsoever. I want to query data, iterate over data, send compute tasks
> to
> >> data. If certain node is outside of BLAT and do not have data, then
> this is
> >> not affinity node. Can we just fix affinity logic to take in count BLAT
> >> appropriately?
> >>
> >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]>
> wrote:
> >>
> >>> Eduard,
> >>>
> >>> Can you please summarize code changes that you are proposing?
> >>> I agree that BLT is a bit misleading term and DAT/SAT make more sense.
> >>> However, establishing a consensus on v2.4 Baseline Topology terminology
> >>> took a long time and seems like you are going to cause a bit more
> >>> perturbations.
> >>> I still don't understand what and how should be changed. Please provide
> >>> summary of upcoming class renamings and changes of existing system
> parts.
> >>>
> >>> Best Regards,
> >>> Ivan Rakov
> >>>
> >>>
> >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> >>>
> >>>> Hi, Igniters,
> >>>>
> >>>> I want to raise a topic about our affinity node definition.
> >>>>
> >>>> After adding baseline (affinity) topology (BL(A)T) things start being
> >>>> complicated.
> >>>>
> >>>> Plenty of bugs appears:
> >>>>
> >>>> IGNITE-8173
> >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works incorrect
> >>>> for
> >>>> replicated cache in case if some data node isn't in baseline
> >>>>
> >>>> IGNITE-7628
> >>>> SqlQuery hangs indefinitely with additional not registered in baseline
> >>>> node.
> >>>>
> >>>> It's because everything relies on concept "affinity node".
> >>>> And until now it was as simple as a server node which passes node
> >> filter.
> >>>> Other words any server node which is not filtered out by node filter.
> >>>>
> >>>> But node which is not in BL(A)T and which passes node filter would be
> >>>> treated as affinity node. And it's definitely wrong. At least, it is a
> >>>> source of many bugs (I believe there are much more than those 2 which
> I
> >>>> already have mentioned).
> >>>>
> >>>> It's clear that this definition should be changed.
> >>>> Let's start with a new definition of "Affinity topology". Affinity
> >>>> topology
> >>>> is a set of nodes which potentially could keep data.
> >>>>
> >>>> If we use knowledge about the current realization we can say that 1.
> for
> >>>> in-memory cache groups it would be all server nodes;
> >>>> 2. for persistent cache groups it would be BL(A)T.
> >>>>
> >>>> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory
> >> cache
> >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
> >> point.
> >>>> Denote node filter as f(X), where X is affinity topology.
> >>>>
> >>>> Then we can say that node A is affinity node if
> >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> >>>>
> >>>> It worth to mention that AT' should be used to pass to affinity
> function
> >>>> of
> >>>> cache groups.
> >>>> Also, AT and AT' could change during the time (BL(A)T changes or node
> >>>> joins/disconnections).
> >>>>
> >>>> And I don't like fact that usage of DAT or SAT relies on persistence
> >>>> settings (Should we make it configurable per cache group?).
> >>>>
> >>>> Ok, I have created a ticket to implement this changes and will start
> >>>> working on it.
> >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> >>>> calculation doesn't take into account BLT).
> >>>>
> >>>> Also, I want to use these definitions (Affinity Topology, Affinity
> Node,
> >>>> DAT, SAT) in documentation and java docs.
> >>>>
> >>>> Maybe, we also should consider replacing BL(A)T with SAT.
> >>>>
> >>>> Thank you for your attention.
> >>>>
> >>>>
>
>

Alexey Goncharuk

Re: New definition for affinity node (issues with baseline)

Vladimir,

Automatic cluster membership changes may be implemented to grow the
topology, but auto-shrinking topology is usually not possible because a
process cannot distinguish between a node shutdown and network
partitioning. If we want to deal with split-brain scenarios as a grown-up
system, we should change the replication strategy within partitions to a
consensus algorithm (I really hope we will). None of the consensus
algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
adjustments based on a internally-detected process failure. I consider
baseline topology as a step towards this model.

Addressing your second concern, If a node was down for a short period of
time, we should (and we do) rebalance only deltas, which is faster than
erasing the whole node and moving all data from scratch.

2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> Ivan,
>
> This reasoning sounds questionable to me. First, separate logic for in
> memory and persistent regions means that we loose collocation between
> persistent and non persistent caches. Second, “data is still on disk”
> assumption might be not valid if node has left due to disk crash, or when
> data is updated on remaining nodes.
>
> вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <[hidden email]>:
>
> > Stan,
> >
> > I believe it was discussed at the design proposal thread:
> >
> > http://apache-ignite-developers.2346864.n4.nabble.
> com/Cluster-auto-activation-design-proposal-td20295.html
> >
> > The short answer: backup factor decreases if node leaves. In
> > non-persistent mode we have to rebalance data ASAP - otherwise last node
> > that owns partition may fail and data will be lost forever.
> > This is not necessary if data is persisted to disk storage, that's the
> > reason for Baseline Topology concept.
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > > + for Vladimir's point - adding more complexity may (and likely will)
> be
> > > even more misleading.
> > >
> > > Can we take a step back and discuss why do we need to have different
> > > behavior for persistent and in-memory caches? Can we make in-memory
> > caches
> > > honor baseline instead of special-casing them?
> > >
> > > Thanks,
> > > Stan
> > >
> > >
> > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:
> > >
> > >> Guys,
> > >>
> > >> As a user I definitely do not want to think about BLATs, SATs, DATs,
> > >> whatsoever. I want to query data, iterate over data, send compute
> tasks
> > to
> > >> data. If certain node is outside of BLAT and do not have data, then
> > this is
> > >> not affinity node. Can we just fix affinity logic to take in count
> BLAT
> > >> appropriately?
> > >>
> > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]>
> > wrote:
> > >>
> > >>> Eduard,
> > >>>
> > >>> Can you please summarize code changes that you are proposing?
> > >>> I agree that BLT is a bit misleading term and DAT/SAT make more
> sense.
> > >>> However, establishing a consensus on v2.4 Baseline Topology
> terminology
> > >>> took a long time and seems like you are going to cause a bit more
> > >>> perturbations.
> > >>> I still don't understand what and how should be changed. Please
> provide
> > >>> summary of upcoming class renamings and changes of existing system
> > parts.
> > >>>
> > >>> Best Regards,
> > >>> Ivan Rakov
> > >>>
> > >>>
> > >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> > >>>
> > >>>> Hi, Igniters,
> > >>>>
> > >>>> I want to raise a topic about our affinity node definition.
> > >>>>
> > >>>> After adding baseline (affinity) topology (BL(A)T) things start
> being
> > >>>> complicated.
> > >>>>
> > >>>> Plenty of bugs appears:
> > >>>>
> > >>>> IGNITE-8173
> > >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works
> incorrect
> > >>>> for
> > >>>> replicated cache in case if some data node isn't in baseline
> > >>>>
> > >>>> IGNITE-7628
> > >>>> SqlQuery hangs indefinitely with additional not registered in
> baseline
> > >>>> node.
> > >>>>
> > >>>> It's because everything relies on concept "affinity node".
> > >>>> And until now it was as simple as a server node which passes node
> > >> filter.
> > >>>> Other words any server node which is not filtered out by node
> filter.
> > >>>>
> > >>>> But node which is not in BL(A)T and which passes node filter would
> be
> > >>>> treated as affinity node. And it's definitely wrong. At least, it
> is a
> > >>>> source of many bugs (I believe there are much more than those 2
> which
> > I
> > >>>> already have mentioned).
> > >>>>
> > >>>> It's clear that this definition should be changed.
> > >>>> Let's start with a new definition of "Affinity topology". Affinity
> > >>>> topology
> > >>>> is a set of nodes which potentially could keep data.
> > >>>>
> > >>>> If we use knowledge about the current realization we can say that 1.
> > for
> > >>>> in-memory cache groups it would be all server nodes;
> > >>>> 2. for persistent cache groups it would be BL(A)T.
> > >>>>
> > >>>> I will further use Dynamic Affinity Topology or DAT for 1 (in-memory
> > >> cache
> > >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
> > >> point.
> > >>>> Denote node filter as f(X), where X is affinity topology.
> > >>>>
> > >>>> Then we can say that node A is affinity node if
> > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> > >>>>
> > >>>> It worth to mention that AT' should be used to pass to affinity
> > function
> > >>>> of
> > >>>> cache groups.
> > >>>> Also, AT and AT' could change during the time (BL(A)T changes or
> node
> > >>>> joins/disconnections).
> > >>>>
> > >>>> And I don't like fact that usage of DAT or SAT relies on persistence
> > >>>> settings (Should we make it configurable per cache group?).
> > >>>>
> > >>>> Ok, I have created a ticket to implement this changes and will start
> > >>>> working on it.
> > >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> > >>>> calculation doesn't take into account BLT).
> > >>>>
> > >>>> Also, I want to use these definitions (Affinity Topology, Affinity
> > Node,
> > >>>> DAT, SAT) in documentation and java docs.
> > >>>>
> > >>>> Maybe, we also should consider replacing BL(A)T with SAT.
> > >>>>
> > >>>> Thank you for your attention.
> > >>>>
> > >>>>
> >
> >
>

Eduard Shangareev

Re: New definition for affinity node (issues with baseline)

Igniters,

I have introduced DAT in opposition to BLAT (SAT) because they reflect how
Ignite works.

But I actually have concerns about the necessity of such separation.

DAT exists only because we don't want to lose any data in in-memory caches.

But there are alternatives. Besides BLAT auto-change policies I would pay
attention to next approach:
- for in-memory caches, affinity would calculate with SAT/BLAT on the first
step and because of it collocation would work between in-memory and
persistent caches;
- on the next step, if there are offline nodes, we would spread their
partitions among alive nodes. This would save us from data loss.

I don't want to propose any changes until we don't have consensus.

On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Vladimir,
>
> Automatic cluster membership changes may be implemented to grow the
> topology, but auto-shrinking topology is usually not possible because a
> process cannot distinguish between a node shutdown and network
> partitioning. If we want to deal with split-brain scenarios as a grown-up
> system, we should change the replication strategy within partitions to a
> consensus algorithm (I really hope we will). None of the consensus
> algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
> adjustments based on a internally-detected process failure. I consider
> baseline topology as a step towards this model.
>
> Addressing your second concern, If a node was down for a short period of
> time, we should (and we do) rebalance only deltas, which is faster than
> erasing the whole node and moving all data from scratch.
>
> 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>
> > Ivan,
> >
> > This reasoning sounds questionable to me. First, separate logic for in
> > memory and persistent regions means that we loose collocation between
> > persistent and non persistent caches. Second, “data is still on disk”
> > assumption might be not valid if node has left due to disk crash, or when
> > data is updated on remaining nodes.
> >
> > вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <[hidden email]>:
> >
> > > Stan,
> > >
> > > I believe it was discussed at the design proposal thread:
> > >
> > > http://apache-ignite-developers.2346864.n4.nabble.
> > com/Cluster-auto-activation-design-proposal-td20295.html
> > >
> > > The short answer: backup factor decreases if node leaves. In
> > > non-persistent mode we have to rebalance data ASAP - otherwise last
> node
> > > that owns partition may fail and data will be lost forever.
> > > This is not necessary if data is persisted to disk storage, that's the
> > > reason for Baseline Topology concept.
> > >
> > > Best Regards,
> > > Ivan Rakov
> > >
> > > On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > > > + for Vladimir's point - adding more complexity may (and likely will)
> > be
> > > > even more misleading.
> > > >
> > > > Can we take a step back and discuss why do we need to have different
> > > > behavior for persistent and in-memory caches? Can we make in-memory
> > > caches
> > > > honor baseline instead of special-casing them?
> > > >
> > > > Thanks,
> > > > Stan
> > > >
> > > >
> > > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:
> > > >
> > > >> Guys,
> > > >>
> > > >> As a user I definitely do not want to think about BLATs, SATs, DATs,
> > > >> whatsoever. I want to query data, iterate over data, send compute
> > tasks
> > > to
> > > >> data. If certain node is outside of BLAT and do not have data, then
> > > this is
> > > >> not affinity node. Can we just fix affinity logic to take in count
> > BLAT
> > > >> appropriately?
> > > >>
> > > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]>
> > > wrote:
> > > >>
> > > >>> Eduard,
> > > >>>
> > > >>> Can you please summarize code changes that you are proposing?
> > > >>> I agree that BLT is a bit misleading term and DAT/SAT make more
> > sense.
> > > >>> However, establishing a consensus on v2.4 Baseline Topology
> > terminology
> > > >>> took a long time and seems like you are going to cause a bit more
> > > >>> perturbations.
> > > >>> I still don't understand what and how should be changed. Please
> > provide
> > > >>> summary of upcoming class renamings and changes of existing system
> > > parts.
> > > >>>
> > > >>> Best Regards,
> > > >>> Ivan Rakov
> > > >>>
> > > >>>
> > > >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> > > >>>
> > > >>>> Hi, Igniters,
> > > >>>>
> > > >>>> I want to raise a topic about our affinity node definition.
> > > >>>>
> > > >>>> After adding baseline (affinity) topology (BL(A)T) things start
> > being
> > > >>>> complicated.
> > > >>>>
> > > >>>> Plenty of bugs appears:
> > > >>>>
> > > >>>> IGNITE-8173
> > > >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works
> > incorrect
> > > >>>> for
> > > >>>> replicated cache in case if some data node isn't in baseline
> > > >>>>
> > > >>>> IGNITE-7628
> > > >>>> SqlQuery hangs indefinitely with additional not registered in
> > baseline
> > > >>>> node.
> > > >>>>
> > > >>>> It's because everything relies on concept "affinity node".
> > > >>>> And until now it was as simple as a server node which passes node
> > > >> filter.
> > > >>>> Other words any server node which is not filtered out by node
> > filter.
> > > >>>>
> > > >>>> But node which is not in BL(A)T and which passes node filter would
> > be
> > > >>>> treated as affinity node. And it's definitely wrong. At least, it
> > is a
> > > >>>> source of many bugs (I believe there are much more than those 2
> > which
> > > I
> > > >>>> already have mentioned).
> > > >>>>
> > > >>>> It's clear that this definition should be changed.
> > > >>>> Let's start with a new definition of "Affinity topology". Affinity
> > > >>>> topology
> > > >>>> is a set of nodes which potentially could keep data.
> > > >>>>
> > > >>>> If we use knowledge about the current realization we can say that
> 1.
> > > for
> > > >>>> in-memory cache groups it would be all server nodes;
> > > >>>> 2. for persistent cache groups it would be BL(A)T.
> > > >>>>
> > > >>>> I will further use Dynamic Affinity Topology or DAT for 1
> (in-memory
> > > >> cache
> > > >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
> > > >> point.
> > > >>>> Denote node filter as f(X), where X is affinity topology.
> > > >>>>
> > > >>>> Then we can say that node A is affinity node if
> > > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> > > >>>>
> > > >>>> It worth to mention that AT' should be used to pass to affinity
> > > function
> > > >>>> of
> > > >>>> cache groups.
> > > >>>> Also, AT and AT' could change during the time (BL(A)T changes or
> > node
> > > >>>> joins/disconnections).
> > > >>>>
> > > >>>> And I don't like fact that usage of DAT or SAT relies on
> persistence
> > > >>>> settings (Should we make it configurable per cache group?).
> > > >>>>
> > > >>>> Ok, I have created a ticket to implement this changes and will
> start
> > > >>>> working on it.
> > > >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> > > >>>> calculation doesn't take into account BLT).
> > > >>>>
> > > >>>> Also, I want to use these definitions (Affinity Topology, Affinity
> > > Node,
> > > >>>> DAT, SAT) in documentation and java docs.
> > > >>>>
> > > >>>> Maybe, we also should consider replacing BL(A)T with SAT.
> > > >>>>
> > > >>>> Thank you for your attention.
> > > >>>>
> > > >>>>
> > >
> > >
> >
>

Ivan Rakov

Re: New definition for affinity node (issues with baseline)

> - for in-memory caches, affinity would calculate with SAT/BLAT on the first
> step and because of it collocation would work between in-memory and
> persistent caches;
> - on the next step, if there are offline nodes, we would spread their
> partitions among alive nodes. This would save us from data loss.
+1 to this approach.
I can't estimate how hard is it to implement, but seems like it solves
both collocation and data loss issues.

Best Regards,
Ivan Rakov

On 24.04.2018 20:29, Eduard Shangareev wrote:

> Igniters,
>
> I have introduced DAT in opposition to BLAT (SAT) because they reflect how
> Ignite works.
>
> But I actually have concerns about the necessity of such separation.
>
> DAT exists only because we don't want to lose any data in in-memory caches.
>
> But there are alternatives. Besides BLAT auto-change policies I would pay
> attention to next approach:
> - for in-memory caches, affinity would calculate with SAT/BLAT on the first
> step and because of it collocation would work between in-memory and
> persistent caches;
> - on the next step, if there are offline nodes, we would spread their
> partitions among alive nodes. This would save us from data loss.
>
> I don't want to propose any changes until we don't have consensus.
>
>
>
> On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
> [hidden email]> wrote:
>
>> Vladimir,
>>
>> Automatic cluster membership changes may be implemented to grow the
>> topology, but auto-shrinking topology is usually not possible because a
>> process cannot distinguish between a node shutdown and network
>> partitioning. If we want to deal with split-brain scenarios as a grown-up
>> system, we should change the replication strategy within partitions to a
>> consensus algorithm (I really hope we will). None of the consensus
>> algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
>> adjustments based on a internally-detected process failure. I consider
>> baseline topology as a step towards this model.
>>
>> Addressing your second concern, If a node was down for a short period of
>> time, we should (and we do) rebalance only deltas, which is faster than
>> erasing the whole node and moving all data from scratch.
>>
>> 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>>
>>> Ivan,
>>>
>>> This reasoning sounds questionable to me. First, separate logic for in
>>> memory and persistent regions means that we loose collocation between
>>> persistent and non persistent caches. Second, “data is still on disk”
>>> assumption might be not valid if node has left due to disk crash, or when
>>> data is updated on remaining nodes.
>>>
>>> вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <[hidden email]>:
>>>
>>>> Stan,
>>>>
>>>> I believe it was discussed at the design proposal thread:
>>>>
>>>> http://apache-ignite-developers.2346864.n4.nabble.
>>> com/Cluster-auto-activation-design-proposal-td20295.html
>>>> The short answer: backup factor decreases if node leaves. In
>>>> non-persistent mode we have to rebalance data ASAP - otherwise last
>> node
>>>> that owns partition may fail and data will be lost forever.
>>>> This is not necessary if data is persisted to disk storage, that's the
>>>> reason for Baseline Topology concept.
>>>>
>>>> Best Regards,
>>>> Ivan Rakov
>>>>
>>>> On 24.04.2018 18:48, Stanislav Lukyanov wrote:
>>>>> + for Vladimir's point - adding more complexity may (and likely will)
>>> be
>>>>> even more misleading.
>>>>>
>>>>> Can we take a step back and discuss why do we need to have different
>>>>> behavior for persistent and in-memory caches? Can we make in-memory
>>>> caches
>>>>> honor baseline instead of special-casing them?
>>>>>
>>>>> Thanks,
>>>>> Stan
>>>>>
>>>>>
>>>>> вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:
>>>>>
>>>>>> Guys,
>>>>>>
>>>>>> As a user I definitely do not want to think about BLATs, SATs, DATs,
>>>>>> whatsoever. I want to query data, iterate over data, send compute
>>> tasks
>>>> to
>>>>>> data. If certain node is outside of BLAT and do not have data, then
>>>> this is
>>>>>> not affinity node. Can we just fix affinity logic to take in count
>>> BLAT
>>>>>> appropriately?
>>>>>>
>>>>>> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <[hidden email]>
>>>> wrote:
>>>>>>> Eduard,
>>>>>>>
>>>>>>> Can you please summarize code changes that you are proposing?
>>>>>>> I agree that BLT is a bit misleading term and DAT/SAT make more
>>> sense.
>>>>>>> However, establishing a consensus on v2.4 Baseline Topology
>>> terminology
>>>>>>> took a long time and seems like you are going to cause a bit more
>>>>>>> perturbations.
>>>>>>> I still don't understand what and how should be changed. Please
>>> provide
>>>>>>> summary of upcoming class renamings and changes of existing system
>>>> parts.
>>>>>>> Best Regards,
>>>>>>> Ivan Rakov
>>>>>>>
>>>>>>>
>>>>>>> On 24.04.2018 17:46, Eduard Shangareev wrote:
>>>>>>>
>>>>>>>> Hi, Igniters,
>>>>>>>>
>>>>>>>> I want to raise a topic about our affinity node definition.
>>>>>>>>
>>>>>>>> After adding baseline (affinity) topology (BL(A)T) things start
>>> being
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> Plenty of bugs appears:
>>>>>>>>
>>>>>>>> IGNITE-8173
>>>>>>>> ignite.getOrCreateCache(cacheConfig).iterator() method works
>>> incorrect
>>>>>>>> for
>>>>>>>> replicated cache in case if some data node isn't in baseline
>>>>>>>>
>>>>>>>> IGNITE-7628
>>>>>>>> SqlQuery hangs indefinitely with additional not registered in
>>> baseline
>>>>>>>> node.
>>>>>>>>
>>>>>>>> It's because everything relies on concept "affinity node".
>>>>>>>> And until now it was as simple as a server node which passes node
>>>>>> filter.
>>>>>>>> Other words any server node which is not filtered out by node
>>> filter.
>>>>>>>> But node which is not in BL(A)T and which passes node filter would
>>> be
>>>>>>>> treated as affinity node. And it's definitely wrong. At least, it
>>> is a
>>>>>>>> source of many bugs (I believe there are much more than those 2
>>> which
>>>> I
>>>>>>>> already have mentioned).
>>>>>>>>
>>>>>>>> It's clear that this definition should be changed.
>>>>>>>> Let's start with a new definition of "Affinity topology". Affinity
>>>>>>>> topology
>>>>>>>> is a set of nodes which potentially could keep data.
>>>>>>>>
>>>>>>>> If we use knowledge about the current realization we can say that
>> 1.
>>>> for
>>>>>>>> in-memory cache groups it would be all server nodes;
>>>>>>>> 2. for persistent cache groups it would be BL(A)T.
>>>>>>>>
>>>>>>>> I will further use Dynamic Affinity Topology or DAT for 1
>> (in-memory
>>>>>> cache
>>>>>>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
>>>>>> point.
>>>>>>>> Denote node filter as f(X), where X is affinity topology.
>>>>>>>>
>>>>>>>> Then we can say that node A is affinity node if
>>>>>>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
>>>>>>>>
>>>>>>>> It worth to mention that AT' should be used to pass to affinity
>>>> function
>>>>>>>> of
>>>>>>>> cache groups.
>>>>>>>> Also, AT and AT' could change during the time (BL(A)T changes or
>>> node
>>>>>>>> joins/disconnections).
>>>>>>>>
>>>>>>>> And I don't like fact that usage of DAT or SAT relies on
>> persistence
>>>>>>>> settings (Should we make it configurable per cache group?).
>>>>>>>>
>>>>>>>> Ok, I have created a ticket to implement this changes and will
>> start
>>>>>>>> working on it.
>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
>>>>>>>> calculation doesn't take into account BLT).
>>>>>>>>
>>>>>>>> Also, I want to use these definitions (Affinity Topology, Affinity
>>>> Node,
>>>>>>>> DAT, SAT) in documentation and java docs.
>>>>>>>>
>>>>>>>> Maybe, we also should consider replacing BL(A)T with SAT.
>>>>>>>>
>>>>>>>> Thank you for your attention.
>>>>>>>>
>>>>>>>>
>>>>

Vladimir Ozerov

Re: New definition for affinity node (issues with baseline)

In reply to this post by Alexey Goncharuk

Alex,

CockroachDB is based on RAFT and is able to repair itself automatically [1]
[2]. Their approach looks reasonable to me and is pretty much similar to
MongoDB and Cassandra. In short, you distinguish between short-term and
long-term failures.
1) First, you wait for small time window in hope that it was a network
glitch or restart. Even if this was a segmentation, with true consensus
algorithm this is not an issue - you partitions or the whole cluster is
unavailable during this window.
2) Then, if majority is still there and cluster is operational you trigger
automatic rebalance.
3) Last, if you need fine-grained control you can tune or disable
auto-rebalance and do some manual magic.

This is very nice approach: it is simple for simple use cases and complex
for complex use cases. Ideally, this is how Ignite should work. Want to
play and write hello-world app? Just learn what cache is. Started
developing moderately complex application? Learn about affinity, cache
modes, etc.. Going to enterprise scale? Learn about BLAT, activation, etc..

It seems that old behavior without BLAT and even without manual activation
would be enough for majority of our users. At the very least it is enough
for order of magnitude more popular Cassandra and MongoDB.

[1]
https://www.cockroachlabs.com/docs/stable/frequently-asked-questions.html#how-does-cockroachdb-survive-failures
[2]
https://www.cockroachlabs.com/docs/stable/training/fault-tolerance-and-automated-repair.html

On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
[hidden email]> wrote:

Alexey Goncharuk

Re: New definition for affinity node (issues with baseline)

Well, this means that the concept of baseline is still needed because we
must not reassign partitions immediately (note that this is not identical
to rebalance delay!). The approach you describe is identical to baseline
change policies and I have nothing against this, their implementation was
planned to phase II of baseline changes.

2018-04-24 21:31 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> Alex,
>
> CockroachDB is based on RAFT and is able to repair itself automatically [1]
> [2]. Their approach looks reasonable to me and is pretty much similar to
> MongoDB and Cassandra. In short, you distinguish between short-term and
> long-term failures.
> 1) First, you wait for small time window in hope that it was a network
> glitch or restart. Even if this was a segmentation, with true consensus
> algorithm this is not an issue - you partitions or the whole cluster is
> unavailable during this window.
> 2) Then, if majority is still there and cluster is operational you trigger
> automatic rebalance.
> 3) Last, if you need fine-grained control you can tune or disable
> auto-rebalance and do some manual magic.
>
> This is very nice approach: it is simple for simple use cases and complex
> for complex use cases. Ideally, this is how Ignite should work. Want to
> play and write hello-world app? Just learn what cache is. Started
> developing moderately complex application? Learn about affinity, cache
> modes, etc.. Going to enterprise scale? Learn about BLAT, activation, etc..
>
> It seems that old behavior without BLAT and even without manual activation
> would be enough for majority of our users. At the very least it is enough
> for order of magnitude more popular Cassandra and MongoDB.
>
> [1]
> https://www.cockroachlabs.com/docs/stable/frequently-asked-
> questions.html#how-does-cockroachdb-survive-failures
> [2]
> https://www.cockroachlabs.com/docs/stable/training/fault-
> tolerance-and-automated-repair.html
>
> On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
> [hidden email]> wrote:
>
> > Vladimir,
> >
> > Automatic cluster membership changes may be implemented to grow the
> > topology, but auto-shrinking topology is usually not possible because a
> > process cannot distinguish between a node shutdown and network
> > partitioning. If we want to deal with split-brain scenarios as a grown-up
> > system, we should change the replication strategy within partitions to a
> > consensus algorithm (I really hope we will). None of the consensus
> > algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
> > adjustments based on a internally-detected process failure. I consider
> > baseline topology as a step towards this model.
> >
> > Addressing your second concern, If a node was down for a short period of
> > time, we should (and we do) rebalance only deltas, which is faster than
> > erasing the whole node and moving all data from scratch.
> >
> > 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> >
> > > Ivan,
> > >
> > > This reasoning sounds questionable to me. First, separate logic for in
> > > memory and persistent regions means that we loose collocation between
> > > persistent and non persistent caches. Second, “data is still on disk”
> > > assumption might be not valid if node has left due to disk crash, or
> when
> > > data is updated on remaining nodes.
> > >
> > > вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <[hidden email]>:
> > >
> > > > Stan,
> > > >
> > > > I believe it was discussed at the design proposal thread:
> > > >
> > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > com/Cluster-auto-activation-design-proposal-td20295.html
> > > >
> > > > The short answer: backup factor decreases if node leaves. In
> > > > non-persistent mode we have to rebalance data ASAP - otherwise last
> > node
> > > > that owns partition may fail and data will be lost forever.
> > > > This is not necessary if data is persisted to disk storage, that's
> the
> > > > reason for Baseline Topology concept.
> > > >
> > > > Best Regards,
> > > > Ivan Rakov
> > > >
> > > > On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > > > > + for Vladimir's point - adding more complexity may (and likely
> will)
> > > be
> > > > > even more misleading.
> > > > >
> > > > > Can we take a step back and discuss why do we need to have
> different
> > > > > behavior for persistent and in-memory caches? Can we make in-memory
> > > > caches
> > > > > honor baseline instead of special-casing them?
> > > > >
> > > > > Thanks,
> > > > > Stan
> > > > >
> > > > >
> > > > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]>:
> > > > >
> > > > >> Guys,
> > > > >>
> > > > >> As a user I definitely do not want to think about BLATs, SATs,
> DATs,
> > > > >> whatsoever. I want to query data, iterate over data, send compute
> > > tasks
> > > > to
> > > > >> data. If certain node is outside of BLAT and do not have data,
> then
> > > > this is
> > > > >> not affinity node. Can we just fix affinity logic to take in count
> > > BLAT
> > > > >> appropriately?
> > > > >>
> > > > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <
> [hidden email]>
> > > > wrote:
> > > > >>
> > > > >>> Eduard,
> > > > >>>
> > > > >>> Can you please summarize code changes that you are proposing?
> > > > >>> I agree that BLT is a bit misleading term and DAT/SAT make more
> > > sense.
> > > > >>> However, establishing a consensus on v2.4 Baseline Topology
> > > terminology
> > > > >>> took a long time and seems like you are going to cause a bit more
> > > > >>> perturbations.
> > > > >>> I still don't understand what and how should be changed. Please
> > > provide
> > > > >>> summary of upcoming class renamings and changes of existing
> system
> > > > parts.
> > > > >>>
> > > > >>> Best Regards,
> > > > >>> Ivan Rakov
> > > > >>>
> > > > >>>
> > > > >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> > > > >>>
> > > > >>>> Hi, Igniters,
> > > > >>>>
> > > > >>>> I want to raise a topic about our affinity node definition.
> > > > >>>>
> > > > >>>> After adding baseline (affinity) topology (BL(A)T) things start
> > > being
> > > > >>>> complicated.
> > > > >>>>
> > > > >>>> Plenty of bugs appears:
> > > > >>>>
> > > > >>>> IGNITE-8173
> > > > >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works
> > > incorrect
> > > > >>>> for
> > > > >>>> replicated cache in case if some data node isn't in baseline
> > > > >>>>
> > > > >>>> IGNITE-7628
> > > > >>>> SqlQuery hangs indefinitely with additional not registered in
> > > baseline
> > > > >>>> node.
> > > > >>>>
> > > > >>>> It's because everything relies on concept "affinity node".
> > > > >>>> And until now it was as simple as a server node which passes
> node
> > > > >> filter.
> > > > >>>> Other words any server node which is not filtered out by node
> > > filter.
> > > > >>>>
> > > > >>>> But node which is not in BL(A)T and which passes node filter
> would
> > > be
> > > > >>>> treated as affinity node. And it's definitely wrong. At least,
> it
> > > is a
> > > > >>>> source of many bugs (I believe there are much more than those 2
> > > which
> > > > I
> > > > >>>> already have mentioned).
> > > > >>>>
> > > > >>>> It's clear that this definition should be changed.
> > > > >>>> Let's start with a new definition of "Affinity topology".
> Affinity
> > > > >>>> topology
> > > > >>>> is a set of nodes which potentially could keep data.
> > > > >>>>
> > > > >>>> If we use knowledge about the current realization we can say
> that
> > 1.
> > > > for
> > > > >>>> in-memory cache groups it would be all server nodes;
> > > > >>>> 2. for persistent cache groups it would be BL(A)T.
> > > > >>>>
> > > > >>>> I will further use Dynamic Affinity Topology or DAT for 1
> > (in-memory
> > > > >> cache
> > > > >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or
> 2nd
> > > > >> point.
> > > > >>>> Denote node filter as f(X), where X is affinity topology.
> > > > >>>>
> > > > >>>> Then we can say that node A is affinity node if
> > > > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> > > > >>>>
> > > > >>>> It worth to mention that AT' should be used to pass to affinity
> > > > function
> > > > >>>> of
> > > > >>>> cache groups.
> > > > >>>> Also, AT and AT' could change during the time (BL(A)T changes or
> > > node
> > > > >>>> joins/disconnections).
> > > > >>>>
> > > > >>>> And I don't like fact that usage of DAT or SAT relies on
> > persistence
> > > > >>>> settings (Should we make it configurable per cache group?).
> > > > >>>>
> > > > >>>> Ok, I have created a ticket to implement this changes and will
> > start
> > > > >>>> working on it.
> > > > >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity
> node
> > > > >>>> calculation doesn't take into account BLT).
> > > > >>>>
> > > > >>>> Also, I want to use these definitions (Affinity Topology,
> Affinity
> > > > Node,
> > > > >>>> DAT, SAT) in documentation and java docs.
> > > > >>>>
> > > > >>>> Maybe, we also should consider replacing BL(A)T with SAT.
> > > > >>>>
> > > > >>>> Thank you for your attention.
> > > > >>>>
> > > > >>>>
> > > >
> > > >
> > >
> >
>

Vladimir Ozerov

Re: New definition for affinity node (issues with baseline)

Right, as far as I understand we are not arguing on whether BLT is needed
or not. The main questions are how to properly deliver this feature to
users and how to deal with co-location issues between persistent and
non-persistent caches. Looks like change policies are the way to go for the
first question.

As far as co-location, it is important to note that different affinity
distribution for in-memory and persistent caches automatically means that
we loose SQL joins and predictable behavior of any affinity-based
operations. It means that if we calculated the same affinity for persistent
and in-memory caches at some point, we cannot re-distribute in-memory
caches differently if some nodes go down without breaking co-located
computations, am I right?

On Tue, Apr 24, 2018 at 10:19 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Well, this means that the concept of baseline is still needed because we
> must not reassign partitions immediately (note that this is not identical
> to rebalance delay!). The approach you describe is identical to baseline
> change policies and I have nothing against this, their implementation was
> planned to phase II of baseline changes.
>
> 2018-04-24 21:31 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>
> > Alex,
> >
> > CockroachDB is based on RAFT and is able to repair itself automatically
> [1]
> > [2]. Their approach looks reasonable to me and is pretty much similar to
> > MongoDB and Cassandra. In short, you distinguish between short-term and
> > long-term failures.
> > 1) First, you wait for small time window in hope that it was a network
> > glitch or restart. Even if this was a segmentation, with true consensus
> > algorithm this is not an issue - you partitions or the whole cluster is
> > unavailable during this window.
> > 2) Then, if majority is still there and cluster is operational you
> trigger
> > automatic rebalance.
> > 3) Last, if you need fine-grained control you can tune or disable
> > auto-rebalance and do some manual magic.
> >
> > This is very nice approach: it is simple for simple use cases and complex
> > for complex use cases. Ideally, this is how Ignite should work. Want to
> > play and write hello-world app? Just learn what cache is. Started
> > developing moderately complex application? Learn about affinity, cache
> > modes, etc.. Going to enterprise scale? Learn about BLAT, activation,
> etc..
> >
> > It seems that old behavior without BLAT and even without manual
> activation
> > would be enough for majority of our users. At the very least it is enough
> > for order of magnitude more popular Cassandra and MongoDB.
> >
> > [1]
> > https://www.cockroachlabs.com/docs/stable/frequently-asked-
> > questions.html#how-does-cockroachdb-survive-failures
> > [2]
> > https://www.cockroachlabs.com/docs/stable/training/fault-
> > tolerance-and-automated-repair.html
> >
> > On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
> > [hidden email]> wrote:
> >
> > > Vladimir,
> > >
> > > Automatic cluster membership changes may be implemented to grow the
> > > topology, but auto-shrinking topology is usually not possible because a
> > > process cannot distinguish between a node shutdown and network
> > > partitioning. If we want to deal with split-brain scenarios as a
> grown-up
> > > system, we should change the replication strategy within partitions to
> a
> > > consensus algorithm (I really hope we will). None of the consensus
> > > algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
> > > adjustments based on a internally-detected process failure. I consider
> > > baseline topology as a step towards this model.
> > >
> > > Addressing your second concern, If a node was down for a short period
> of
> > > time, we should (and we do) rebalance only deltas, which is faster than
> > > erasing the whole node and moving all data from scratch.
> > >
> > > 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> > >
> > > > Ivan,
> > > >
> > > > This reasoning sounds questionable to me. First, separate logic for
> in
> > > > memory and persistent regions means that we loose collocation between
> > > > persistent and non persistent caches. Second, “data is still on disk”
> > > > assumption might be not valid if node has left due to disk crash, or
> > when
> > > > data is updated on remaining nodes.
> > > >
> > > > вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <[hidden email]>:
> > > >
> > > > > Stan,
> > > > >
> > > > > I believe it was discussed at the design proposal thread:
> > > > >
> > > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > > com/Cluster-auto-activation-design-proposal-td20295.html
> > > > >
> > > > > The short answer: backup factor decreases if node leaves. In
> > > > > non-persistent mode we have to rebalance data ASAP - otherwise last
> > > node
> > > > > that owns partition may fail and data will be lost forever.
> > > > > This is not necessary if data is persisted to disk storage, that's
> > the
> > > > > reason for Baseline Topology concept.
> > > > >
> > > > > Best Regards,
> > > > > Ivan Rakov
> > > > >
> > > > > On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > > > > > + for Vladimir's point - adding more complexity may (and likely
> > will)
> > > > be
> > > > > > even more misleading.
> > > > > >
> > > > > > Can we take a step back and discuss why do we need to have
> > different
> > > > > > behavior for persistent and in-memory caches? Can we make
> in-memory
> > > > > caches
> > > > > > honor baseline instead of special-casing them?
> > > > > >
> > > > > > Thanks,
> > > > > > Stan
> > > > > >
> > > > > >
> > > > > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <[hidden email]
> >:
> > > > > >
> > > > > >> Guys,
> > > > > >>
> > > > > >> As a user I definitely do not want to think about BLATs, SATs,
> > DATs,
> > > > > >> whatsoever. I want to query data, iterate over data, send
> compute
> > > > tasks
> > > > > to
> > > > > >> data. If certain node is outside of BLAT and do not have data,
> > then
> > > > > this is
> > > > > >> not affinity node. Can we just fix affinity logic to take in
> count
> > > > BLAT
> > > > > >> appropriately?
> > > > > >>
> > > > > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <
> > [hidden email]>
> > > > > wrote:
> > > > > >>
> > > > > >>> Eduard,
> > > > > >>>
> > > > > >>> Can you please summarize code changes that you are proposing?
> > > > > >>> I agree that BLT is a bit misleading term and DAT/SAT make more
> > > > sense.
> > > > > >>> However, establishing a consensus on v2.4 Baseline Topology
> > > > terminology
> > > > > >>> took a long time and seems like you are going to cause a bit
> more
> > > > > >>> perturbations.
> > > > > >>> I still don't understand what and how should be changed. Please
> > > > provide
> > > > > >>> summary of upcoming class renamings and changes of existing
> > system
> > > > > parts.
> > > > > >>>
> > > > > >>> Best Regards,
> > > > > >>> Ivan Rakov
> > > > > >>>
> > > > > >>>
> > > > > >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> > > > > >>>
> > > > > >>>> Hi, Igniters,
> > > > > >>>>
> > > > > >>>> I want to raise a topic about our affinity node definition.
> > > > > >>>>
> > > > > >>>> After adding baseline (affinity) topology (BL(A)T) things
> start
> > > > being
> > > > > >>>> complicated.
> > > > > >>>>
> > > > > >>>> Plenty of bugs appears:
> > > > > >>>>
> > > > > >>>> IGNITE-8173
> > > > > >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works
> > > > incorrect
> > > > > >>>> for
> > > > > >>>> replicated cache in case if some data node isn't in baseline
> > > > > >>>>
> > > > > >>>> IGNITE-7628
> > > > > >>>> SqlQuery hangs indefinitely with additional not registered in
> > > > baseline
> > > > > >>>> node.
> > > > > >>>>
> > > > > >>>> It's because everything relies on concept "affinity node".
> > > > > >>>> And until now it was as simple as a server node which passes
> > node
> > > > > >> filter.
> > > > > >>>> Other words any server node which is not filtered out by node
> > > > filter.
> > > > > >>>>
> > > > > >>>> But node which is not in BL(A)T and which passes node filter
> > would
> > > > be
> > > > > >>>> treated as affinity node. And it's definitely wrong. At least,
> > it
> > > > is a
> > > > > >>>> source of many bugs (I believe there are much more than those
> 2
> > > > which
> > > > > I
> > > > > >>>> already have mentioned).
> > > > > >>>>
> > > > > >>>> It's clear that this definition should be changed.
> > > > > >>>> Let's start with a new definition of "Affinity topology".
> > Affinity
> > > > > >>>> topology
> > > > > >>>> is a set of nodes which potentially could keep data.
> > > > > >>>>
> > > > > >>>> If we use knowledge about the current realization we can say
> > that
> > > 1.
> > > > > for
> > > > > >>>> in-memory cache groups it would be all server nodes;
> > > > > >>>> 2. for persistent cache groups it would be BL(A)T.
> > > > > >>>>
> > > > > >>>> I will further use Dynamic Affinity Topology or DAT for 1
> > > (in-memory
> > > > > >> cache
> > > > > >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or
> > 2nd
> > > > > >> point.
> > > > > >>>> Denote node filter as f(X), where X is affinity topology.
> > > > > >>>>
> > > > > >>>> Then we can say that node A is affinity node if
> > > > > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> > > > > >>>>
> > > > > >>>> It worth to mention that AT' should be used to pass to
> affinity
> > > > > function
> > > > > >>>> of
> > > > > >>>> cache groups.
> > > > > >>>> Also, AT and AT' could change during the time (BL(A)T changes
> or
> > > > node
> > > > > >>>> joins/disconnections).
> > > > > >>>>
> > > > > >>>> And I don't like fact that usage of DAT or SAT relies on
> > > persistence
> > > > > >>>> settings (Should we make it configurable per cache group?).
> > > > > >>>>
> > > > > >>>> Ok, I have created a ticket to implement this changes and will
> > > start
> > > > > >>>> working on it.
> > > > > >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity
> > node
> > > > > >>>> calculation doesn't take into account BLT).
> > > > > >>>>
> > > > > >>>> Also, I want to use these definitions (Affinity Topology,
> > Affinity
> > > > > Node,
> > > > > >>>> DAT, SAT) in documentation and java docs.
> > > > > >>>>
> > > > > >>>> Maybe, we also should consider replacing BL(A)T with SAT.
> > > > > >>>>
> > > > > >>>> Thank you for your attention.
> > > > > >>>>
> > > > > >>>>
> > > > >
> > > > >
> > > >
> > >
> >
>

dsetrakyan

Re: New definition for affinity node (issues with baseline)

On Wed, Apr 25, 2018 at 4:13 AM, Vladimir Ozerov <[hidden email]>
wrote:

> Right, as far as I understand we are not arguing on whether BLT is needed
> or not. The main questions are how to properly deliver this feature to
> users and how to deal with co-location issues between persistent and
> non-persistent caches. Looks like change policies are the way to go for the
> first question.
>
> As far as co-location, it is important to note that different affinity
> distribution for in-memory and persistent caches automatically means that
> we loose SQL joins and predictable behavior of any affinity-based
> operations. It means that if we calculated the same affinity for persistent
> and in-memory caches at some point, we cannot re-distribute in-memory
> caches differently if some nodes go down without breaking co-located
> computations, am I right?
>

Vova, you are right, but this is rather an edge case. I doubt there are
many users out there who will need to join memory-only with persistent
caches. What you are describing would be nice to support, but I would not
make it a hard requirement. However, if we choose not to support it, we
should have a very good explanation for why not.

Eduard Shangareev

Re: New definition for affinity node (issues with baseline)

Guys,

I have started a new topic to address the issue with DAT [1].

[1]
http://apache-ignite-developers.2346864.n4.nabble.com/IEP-4-Phase-2-Using-BL-A-T-for-in-memory-caches-td29942.html

On Tue, Apr 24, 2018 at 11:43 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Wed, Apr 25, 2018 at 4:13 AM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Right, as far as I understand we are not arguing on whether BLT is needed
> > or not. The main questions are how to properly deliver this feature to
> > users and how to deal with co-location issues between persistent and
> > non-persistent caches. Looks like change policies are the way to go for
> the
> > first question.
> >
> > As far as co-location, it is important to note that different affinity
> > distribution for in-memory and persistent caches automatically means that
> > we loose SQL joins and predictable behavior of any affinity-based
> > operations. It means that if we calculated the same affinity for
> persistent
> > and in-memory caches at some point, we cannot re-distribute in-memory
> > caches differently if some nodes go down without breaking co-located
> > computations, am I right?
> >
>
> Vova, you are right, but this is rather an edge case. I doubt there are
> many users out there who will need to join memory-only with persistent
> caches. What you are describing would be nice to support, but I would not
> make it a hard requirement. However, if we choose not to support it, we
> should have a very good explanation for why not.
>