Active nodes aliveness WatchDog

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Active nodes aliveness WatchDog

Anton Vinogradov-2
Igniters,
Do we have some feature allows to check nodes aliveness on a regular basis?

Scenario:
Precondition
  The cluster has no load but some node's JVM crashed.

Expected actual
  The user performs an operation (eg. cache put) related to this node (via
another node) and waits for some timeout to gain it's dead.
  The cluster starts the switch to relocate primary partitions to alive
nodes.
  Now user able to retry the operation.

Desired
  Some WatchDog checks nodes aliveness on a regular basis.
  Once a failure detected, the cluster starts the switch.
  Later, the user performs an operation on an already fixed cluster and
waits for nothing.

It would be good news if the "Desired" case is already Actual.
Can somebody point to the feature that performs this check?
Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

sdarlington
This is one of the functions of the DiscoverySPI. Nodes check on their neighbours and notify the remaining nodes if one disappears. When the topology changes, it triggers a rebalance, which relocates primary partitions to live nodes. This is entirely transparent to clients.

It gets more complex… like there’s the partition loss policy and rebalancing doesn’t always happen (configurable, persistence, etc)… but broadly it does as you expect.

Regards,
Stephen

> On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
>
> Igniters,
> Do we have some feature allows to check nodes aliveness on a regular basis?
>
> Scenario:
> Precondition
>  The cluster has no load but some node's JVM crashed.
>
> Expected actual
>  The user performs an operation (eg. cache put) related to this node (via
> another node) and waits for some timeout to gain it's dead.
>  The cluster starts the switch to relocate primary partitions to alive
> nodes.
>  Now user able to retry the operation.
>
> Desired
>  Some WatchDog checks nodes aliveness on a regular basis.
>  Once a failure detected, the cluster starts the switch.
>  Later, the user performs an operation on an already fixed cluster and
> waits for nothing.
>
> It would be good news if the "Desired" case is already Actual.
> Can somebody point to the feature that performs this check?


Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

Anton Vinogradov-2
Stephen,

> Nodes check on their neighbours and notify the remaining nodes if one
disappears.
Could you explain how this works in detail?
How can I set/change check frequency?

On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
[hidden email]> wrote:

> This is one of the functions of the DiscoverySPI. Nodes check on their
> neighbours and notify the remaining nodes if one disappears. When the
> topology changes, it triggers a rebalance, which relocates primary
> partitions to live nodes. This is entirely transparent to clients.
>
> It gets more complex… like there’s the partition loss policy and
> rebalancing doesn’t always happen (configurable, persistence, etc)… but
> broadly it does as you expect.
>
> Regards,
> Stephen
>
> > On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
> >
> > Igniters,
> > Do we have some feature allows to check nodes aliveness on a regular
> basis?
> >
> > Scenario:
> > Precondition
> >  The cluster has no load but some node's JVM crashed.
> >
> > Expected actual
> >  The user performs an operation (eg. cache put) related to this node (via
> > another node) and waits for some timeout to gain it's dead.
> >  The cluster starts the switch to relocate primary partitions to alive
> > nodes.
> >  Now user able to retry the operation.
> >
> > Desired
> >  Some WatchDog checks nodes aliveness on a regular basis.
> >  Once a failure detected, the cluster starts the switch.
> >  Later, the user performs an operation on an already fixed cluster and
> > waits for nothing.
> >
> > It would be good news if the "Desired" case is already Actual.
> > Can somebody point to the feature that performs this check?
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

sdarlington
The configuration parameters that I’m aware of are here:

https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html

Other people would be better placed to discuss the internals.

Regards.
Stephen

> On 8 Apr 2020, at 09:32, Anton Vinogradov <[hidden email]> wrote:
>
> Stephen,
>
>> Nodes check on their neighbours and notify the remaining nodes if one
> disappears.
> Could you explain how this works in detail?
> How can I set/change check frequency?
>
> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
> [hidden email]> wrote:
>
>> This is one of the functions of the DiscoverySPI. Nodes check on their
>> neighbours and notify the remaining nodes if one disappears. When the
>> topology changes, it triggers a rebalance, which relocates primary
>> partitions to live nodes. This is entirely transparent to clients.
>>
>> It gets more complex… like there’s the partition loss policy and
>> rebalancing doesn’t always happen (configurable, persistence, etc)… but
>> broadly it does as you expect.
>>
>> Regards,
>> Stephen
>>
>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
>>>
>>> Igniters,
>>> Do we have some feature allows to check nodes aliveness on a regular
>> basis?
>>>
>>> Scenario:
>>> Precondition
>>> The cluster has no load but some node's JVM crashed.
>>>
>>> Expected actual
>>> The user performs an operation (eg. cache put) related to this node (via
>>> another node) and waits for some timeout to gain it's dead.
>>> The cluster starts the switch to relocate primary partitions to alive
>>> nodes.
>>> Now user able to retry the operation.
>>>
>>> Desired
>>> Some WatchDog checks nodes aliveness on a regular basis.
>>> Once a failure detected, the cluster starts the switch.
>>> Later, the user performs an operation on an already fixed cluster and
>>> waits for nothing.
>>>
>>> It would be good news if the "Desired" case is already Actual.
>>> Can somebody point to the feature that performs this check?
>>
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

Anton Vinogradov-2
It seems you're talking about Failure Detection (Timeouts).
Will it detect node failure on still cluster?

On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
[hidden email]> wrote:

> The configuration parameters that I’m aware of are here:
>
>
> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
>
> Other people would be better placed to discuss the internals.
>
> Regards.
> Stephen
>
> > On 8 Apr 2020, at 09:32, Anton Vinogradov <[hidden email]> wrote:
> >
> > Stephen,
> >
> >> Nodes check on their neighbours and notify the remaining nodes if one
> > disappears.
> > Could you explain how this works in detail?
> > How can I set/change check frequency?
> >
> > On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
> > [hidden email]> wrote:
> >
> >> This is one of the functions of the DiscoverySPI. Nodes check on their
> >> neighbours and notify the remaining nodes if one disappears. When the
> >> topology changes, it triggers a rebalance, which relocates primary
> >> partitions to live nodes. This is entirely transparent to clients.
> >>
> >> It gets more complex… like there’s the partition loss policy and
> >> rebalancing doesn’t always happen (configurable, persistence, etc)… but
> >> broadly it does as you expect.
> >>
> >> Regards,
> >> Stephen
> >>
> >>> On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
> >>>
> >>> Igniters,
> >>> Do we have some feature allows to check nodes aliveness on a regular
> >> basis?
> >>>
> >>> Scenario:
> >>> Precondition
> >>> The cluster has no load but some node's JVM crashed.
> >>>
> >>> Expected actual
> >>> The user performs an operation (eg. cache put) related to this node
> (via
> >>> another node) and waits for some timeout to gain it's dead.
> >>> The cluster starts the switch to relocate primary partitions to alive
> >>> nodes.
> >>> Now user able to retry the operation.
> >>>
> >>> Desired
> >>> Some WatchDog checks nodes aliveness on a regular basis.
> >>> Once a failure detected, the cluster starts the switch.
> >>> Later, the user performs an operation on an already fixed cluster and
> >>> waits for nothing.
> >>>
> >>> It would be good news if the "Desired" case is already Actual.
> >>> Can somebody point to the feature that performs this check?
> >>
> >>
> >>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

sdarlington
Yes. Nodes are always chatting to each another even if there are no requests coming In.

Here’s the status message: https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java

Regards,
Stephen

> On 8 Apr 2020, at 10:04, Anton Vinogradov <[hidden email]> wrote:
>
> It seems you're talking about Failure Detection (Timeouts).
> Will it detect node failure on still cluster?
>
> On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
> [hidden email]> wrote:
>
>> The configuration parameters that I’m aware of are here:
>>
>>
>> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
>>
>> Other people would be better placed to discuss the internals.
>>
>> Regards.
>> Stephen
>>
>>> On 8 Apr 2020, at 09:32, Anton Vinogradov <[hidden email]> wrote:
>>>
>>> Stephen,
>>>
>>>> Nodes check on their neighbours and notify the remaining nodes if one
>>> disappears.
>>> Could you explain how this works in detail?
>>> How can I set/change check frequency?
>>>
>>> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
>>> [hidden email]> wrote:
>>>
>>>> This is one of the functions of the DiscoverySPI. Nodes check on their
>>>> neighbours and notify the remaining nodes if one disappears. When the
>>>> topology changes, it triggers a rebalance, which relocates primary
>>>> partitions to live nodes. This is entirely transparent to clients.
>>>>
>>>> It gets more complex… like there’s the partition loss policy and
>>>> rebalancing doesn’t always happen (configurable, persistence, etc)… but
>>>> broadly it does as you expect.
>>>>
>>>> Regards,
>>>> Stephen
>>>>
>>>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
>>>>>
>>>>> Igniters,
>>>>> Do we have some feature allows to check nodes aliveness on a regular
>>>> basis?
>>>>>
>>>>> Scenario:
>>>>> Precondition
>>>>> The cluster has no load but some node's JVM crashed.
>>>>>
>>>>> Expected actual
>>>>> The user performs an operation (eg. cache put) related to this node
>> (via
>>>>> another node) and waits for some timeout to gain it's dead.
>>>>> The cluster starts the switch to relocate primary partitions to alive
>>>>> nodes.
>>>>> Now user able to retry the operation.
>>>>>
>>>>> Desired
>>>>> Some WatchDog checks nodes aliveness on a regular basis.
>>>>> Once a failure detected, the cluster starts the switch.
>>>>> Later, the user performs an operation on an already fixed cluster and
>>>>> waits for nothing.
>>>>>
>>>>> It would be good news if the "Desired" case is already Actual.
>>>>> Can somebody point to the feature that performs this check?
>>>>
>>>>
>>>>
>>
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

Steshin Vladimir
Hi everyone.

I think we should check behavior of failure detection with tests or find
them if already written. I’ll research this question and rise a ticket
if a reproducer appears.



08.04.2020 12:19, Stephen Darlington пишет:

> Yes. Nodes are always chatting to each another even if there are no requests coming In.
>
> Here’s the status message: https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java
>
> Regards,
> Stephen
>
>> On 8 Apr 2020, at 10:04, Anton Vinogradov <[hidden email]> wrote:
>>
>> It seems you're talking about Failure Detection (Timeouts).
>> Will it detect node failure on still cluster?
>>
>> On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
>> [hidden email]> wrote:
>>
>>> The configuration parameters that I’m aware of are here:
>>>
>>>
>>> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
>>>
>>> Other people would be better placed to discuss the internals.
>>>
>>> Regards.
>>> Stephen
>>>
>>>> On 8 Apr 2020, at 09:32, Anton Vinogradov <[hidden email]> wrote:
>>>>
>>>> Stephen,
>>>>
>>>>> Nodes check on their neighbours and notify the remaining nodes if one
>>>> disappears.
>>>> Could you explain how this works in detail?
>>>> How can I set/change check frequency?
>>>>
>>>> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
>>>> [hidden email]> wrote:
>>>>
>>>>> This is one of the functions of the DiscoverySPI. Nodes check on their
>>>>> neighbours and notify the remaining nodes if one disappears. When the
>>>>> topology changes, it triggers a rebalance, which relocates primary
>>>>> partitions to live nodes. This is entirely transparent to clients.
>>>>>
>>>>> It gets more complex… like there’s the partition loss policy and
>>>>> rebalancing doesn’t always happen (configurable, persistence, etc)… but
>>>>> broadly it does as you expect.
>>>>>
>>>>> Regards,
>>>>> Stephen
>>>>>
>>>>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
>>>>>>
>>>>>> Igniters,
>>>>>> Do we have some feature allows to check nodes aliveness on a regular
>>>>> basis?
>>>>>> Scenario:
>>>>>> Precondition
>>>>>> The cluster has no load but some node's JVM crashed.
>>>>>>
>>>>>> Expected actual
>>>>>> The user performs an operation (eg. cache put) related to this node
>>> (via
>>>>>> another node) and waits for some timeout to gain it's dead.
>>>>>> The cluster starts the switch to relocate primary partitions to alive
>>>>>> nodes.
>>>>>> Now user able to retry the operation.
>>>>>>
>>>>>> Desired
>>>>>> Some WatchDog checks nodes aliveness on a regular basis.
>>>>>> Once a failure detected, the cluster starts the switch.
>>>>>> Later, the user performs an operation on an already fixed cluster and
>>>>>> waits for nothing.
>>>>>>
>>>>>> It would be good news if the "Desired" case is already Actual.
>>>>>> Can somebody point to the feature that performs this check?
>>>>>
>>>>>
>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Active nodes aliveness WatchDog

Anton Vinogradov-2
Stephen,
Thanks for the hint.

Vladimir,
Great idea! Let me know if any help needed.

On Wed, Apr 8, 2020 at 2:19 PM Vladimir Steshin <[hidden email]> wrote:

> Hi everyone.
>
> I think we should check behavior of failure detection with tests or find
> them if already written. I’ll research this question and rise a ticket
> if a reproducer appears.
>
>
>
> 08.04.2020 12:19, Stephen Darlington пишет:
> > Yes. Nodes are always chatting to each another even if there are no
> requests coming In.
> >
> > Here’s the status message:
> https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java
> >
> > Regards,
> > Stephen
> >
> >> On 8 Apr 2020, at 10:04, Anton Vinogradov <[hidden email]> wrote:
> >>
> >> It seems you're talking about Failure Detection (Timeouts).
> >> Will it detect node failure on still cluster?
> >>
> >> On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
> >> [hidden email]> wrote:
> >>
> >>> The configuration parameters that I’m aware of are here:
> >>>
> >>>
> >>>
> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
> >>>
> >>> Other people would be better placed to discuss the internals.
> >>>
> >>> Regards.
> >>> Stephen
> >>>
> >>>> On 8 Apr 2020, at 09:32, Anton Vinogradov <[hidden email]> wrote:
> >>>>
> >>>> Stephen,
> >>>>
> >>>>> Nodes check on their neighbours and notify the remaining nodes if one
> >>>> disappears.
> >>>> Could you explain how this works in detail?
> >>>> How can I set/change check frequency?
> >>>>
> >>>> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
> >>>> [hidden email]> wrote:
> >>>>
> >>>>> This is one of the functions of the DiscoverySPI. Nodes check on
> their
> >>>>> neighbours and notify the remaining nodes if one disappears. When the
> >>>>> topology changes, it triggers a rebalance, which relocates primary
> >>>>> partitions to live nodes. This is entirely transparent to clients.
> >>>>>
> >>>>> It gets more complex… like there’s the partition loss policy and
> >>>>> rebalancing doesn’t always happen (configurable, persistence, etc)…
> but
> >>>>> broadly it does as you expect.
> >>>>>
> >>>>> Regards,
> >>>>> Stephen
> >>>>>
> >>>>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <[hidden email]> wrote:
> >>>>>>
> >>>>>> Igniters,
> >>>>>> Do we have some feature allows to check nodes aliveness on a regular
> >>>>> basis?
> >>>>>> Scenario:
> >>>>>> Precondition
> >>>>>> The cluster has no load but some node's JVM crashed.
> >>>>>>
> >>>>>> Expected actual
> >>>>>> The user performs an operation (eg. cache put) related to this node
> >>> (via
> >>>>>> another node) and waits for some timeout to gain it's dead.
> >>>>>> The cluster starts the switch to relocate primary partitions to
> alive
> >>>>>> nodes.
> >>>>>> Now user able to retry the operation.
> >>>>>>
> >>>>>> Desired
> >>>>>> Some WatchDog checks nodes aliveness on a regular basis.
> >>>>>> Once a failure detected, the cluster starts the switch.
> >>>>>> Later, the user performs an operation on an already fixed cluster
> and
> >>>>>> waits for nothing.
> >>>>>>
> >>>>>> It would be good news if the "Desired" case is already Actual.
> >>>>>> Can somebody point to the feature that performs this check?
> >>>>>
> >>>>>
> >>>
> >>>
> >
>