You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Anton Vinogradov <av...@apache.org> on 2020/04/08 07:40:24 UTC

Active nodes aliveness WatchDog

Igniters,
Do we have some feature allows to check nodes aliveness on a regular basis?

Scenario:
Precondition
  The cluster has no load but some node's JVM crashed.

Expected actual
  The user performs an operation (eg. cache put) related to this node (via
another node) and waits for some timeout to gain it's dead.
  The cluster starts the switch to relocate primary partitions to alive
nodes.
  Now user able to retry the operation.

Desired
  Some WatchDog checks nodes aliveness on a regular basis.
  Once a failure detected, the cluster starts the switch.
  Later, the user performs an operation on an already fixed cluster and
waits for nothing.

It would be good news if the "Desired" case is already Actual.
Can somebody point to the feature that performs this check?

Re: Active nodes aliveness WatchDog

Posted by Anton Vinogradov <av...@apache.org>.

Stephen,
Thanks for the hint.

Vladimir,
Great idea! Let me know if any help needed.

On Wed, Apr 8, 2020 at 2:19 PM Vladimir Steshin <vl...@gmail.com> wrote:

> Hi everyone.
>
> I think we should check behavior of failure detection with tests or find
> them if already written. I’ll research this question and rise a ticket
> if a reproducer appears.
>
>
>
> 08.04.2020 12:19, Stephen Darlington пишет:
> > Yes. Nodes are always chatting to each another even if there are no
> requests coming In.
> >
> > Here’s the status message:
> https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java
> >
> > Regards,
> > Stephen
> >
> >> On 8 Apr 2020, at 10:04, Anton Vinogradov <av...@apache.org> wrote:
> >>
> >> It seems you're talking about Failure Detection (Timeouts).
> >> Will it detect node failure on still cluster?
> >>
> >> On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
> >> stephen.darlington@gridgain.com> wrote:
> >>
> >>> The configuration parameters that I’m aware of are here:
> >>>
> >>>
> >>>
> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
> >>>
> >>> Other people would be better placed to discuss the internals.
> >>>
> >>> Regards.
> >>> Stephen
> >>>
> >>>> On 8 Apr 2020, at 09:32, Anton Vinogradov <av...@apache.org> wrote:
> >>>>
> >>>> Stephen,
> >>>>
> >>>>> Nodes check on their neighbours and notify the remaining nodes if one
> >>>> disappears.
> >>>> Could you explain how this works in detail?
> >>>> How can I set/change check frequency?
> >>>>
> >>>> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
> >>>> stephen.darlington@gridgain.com> wrote:
> >>>>
> >>>>> This is one of the functions of the DiscoverySPI. Nodes check on
> their
> >>>>> neighbours and notify the remaining nodes if one disappears. When the
> >>>>> topology changes, it triggers a rebalance, which relocates primary
> >>>>> partitions to live nodes. This is entirely transparent to clients.
> >>>>>
> >>>>> It gets more complex… like there’s the partition loss policy and
> >>>>> rebalancing doesn’t always happen (configurable, persistence, etc)…
> but
> >>>>> broadly it does as you expect.
> >>>>>
> >>>>> Regards,
> >>>>> Stephen
> >>>>>
> >>>>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
> >>>>>>
> >>>>>> Igniters,
> >>>>>> Do we have some feature allows to check nodes aliveness on a regular
> >>>>> basis?
> >>>>>> Scenario:
> >>>>>> Precondition
> >>>>>> The cluster has no load but some node's JVM crashed.
> >>>>>>
> >>>>>> Expected actual
> >>>>>> The user performs an operation (eg. cache put) related to this node
> >>> (via
> >>>>>> another node) and waits for some timeout to gain it's dead.
> >>>>>> The cluster starts the switch to relocate primary partitions to
> alive
> >>>>>> nodes.
> >>>>>> Now user able to retry the operation.
> >>>>>>
> >>>>>> Desired
> >>>>>> Some WatchDog checks nodes aliveness on a regular basis.
> >>>>>> Once a failure detected, the cluster starts the switch.
> >>>>>> Later, the user performs an operation on an already fixed cluster
> and
> >>>>>> waits for nothing.
> >>>>>>
> >>>>>> It would be good news if the "Desired" case is already Actual.
> >>>>>> Can somebody point to the feature that performs this check?
> >>>>>
> >>>>>
> >>>
> >>>
> >
>

Re: Active nodes aliveness WatchDog

Posted by Vladimir Steshin <vl...@gmail.com>.

Hi everyone.

I think we should check behavior of failure detection with tests or find 
them if already written. I’ll research this question and rise a ticket 
if a reproducer appears.



08.04.2020 12:19, Stephen Darlington пишет:
> Yes. Nodes are always chatting to each another even if there are no requests coming In.
>
> Here’s the status message: https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java
>
> Regards,
> Stephen
>
>> On 8 Apr 2020, at 10:04, Anton Vinogradov <av...@apache.org> wrote:
>>
>> It seems you're talking about Failure Detection (Timeouts).
>> Will it detect node failure on still cluster?
>>
>> On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
>> stephen.darlington@gridgain.com> wrote:
>>
>>> The configuration parameters that I’m aware of are here:
>>>
>>>
>>> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
>>>
>>> Other people would be better placed to discuss the internals.
>>>
>>> Regards.
>>> Stephen
>>>
>>>> On 8 Apr 2020, at 09:32, Anton Vinogradov <av...@apache.org> wrote:
>>>>
>>>> Stephen,
>>>>
>>>>> Nodes check on their neighbours and notify the remaining nodes if one
>>>> disappears.
>>>> Could you explain how this works in detail?
>>>> How can I set/change check frequency?
>>>>
>>>> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
>>>> stephen.darlington@gridgain.com> wrote:
>>>>
>>>>> This is one of the functions of the DiscoverySPI. Nodes check on their
>>>>> neighbours and notify the remaining nodes if one disappears. When the
>>>>> topology changes, it triggers a rebalance, which relocates primary
>>>>> partitions to live nodes. This is entirely transparent to clients.
>>>>>
>>>>> It gets more complex… like there’s the partition loss policy and
>>>>> rebalancing doesn’t always happen (configurable, persistence, etc)… but
>>>>> broadly it does as you expect.
>>>>>
>>>>> Regards,
>>>>> Stephen
>>>>>
>>>>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
>>>>>>
>>>>>> Igniters,
>>>>>> Do we have some feature allows to check nodes aliveness on a regular
>>>>> basis?
>>>>>> Scenario:
>>>>>> Precondition
>>>>>> The cluster has no load but some node's JVM crashed.
>>>>>>
>>>>>> Expected actual
>>>>>> The user performs an operation (eg. cache put) related to this node
>>> (via
>>>>>> another node) and waits for some timeout to gain it's dead.
>>>>>> The cluster starts the switch to relocate primary partitions to alive
>>>>>> nodes.
>>>>>> Now user able to retry the operation.
>>>>>>
>>>>>> Desired
>>>>>> Some WatchDog checks nodes aliveness on a regular basis.
>>>>>> Once a failure detected, the cluster starts the switch.
>>>>>> Later, the user performs an operation on an already fixed cluster and
>>>>>> waits for nothing.
>>>>>>
>>>>>> It would be good news if the "Desired" case is already Actual.
>>>>>> Can somebody point to the feature that performs this check?
>>>>>
>>>>>
>>>
>>>
>

Re: Active nodes aliveness WatchDog

Posted by Stephen Darlington <st...@gridgain.com>.

Yes. Nodes are always chatting to each another even if there are no requests coming In.

Here’s the status message: https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java

Regards,
Stephen

> On 8 Apr 2020, at 10:04, Anton Vinogradov <av...@apache.org> wrote:
> 
> It seems you're talking about Failure Detection (Timeouts).
> Will it detect node failure on still cluster?
> 
> On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
> stephen.darlington@gridgain.com> wrote:
> 
>> The configuration parameters that I’m aware of are here:
>> 
>> 
>> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
>> 
>> Other people would be better placed to discuss the internals.
>> 
>> Regards.
>> Stephen
>> 
>>> On 8 Apr 2020, at 09:32, Anton Vinogradov <av...@apache.org> wrote:
>>> 
>>> Stephen,
>>> 
>>>> Nodes check on their neighbours and notify the remaining nodes if one
>>> disappears.
>>> Could you explain how this works in detail?
>>> How can I set/change check frequency?
>>> 
>>> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
>>> stephen.darlington@gridgain.com> wrote:
>>> 
>>>> This is one of the functions of the DiscoverySPI. Nodes check on their
>>>> neighbours and notify the remaining nodes if one disappears. When the
>>>> topology changes, it triggers a rebalance, which relocates primary
>>>> partitions to live nodes. This is entirely transparent to clients.
>>>> 
>>>> It gets more complex… like there’s the partition loss policy and
>>>> rebalancing doesn’t always happen (configurable, persistence, etc)… but
>>>> broadly it does as you expect.
>>>> 
>>>> Regards,
>>>> Stephen
>>>> 
>>>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
>>>>> 
>>>>> Igniters,
>>>>> Do we have some feature allows to check nodes aliveness on a regular
>>>> basis?
>>>>> 
>>>>> Scenario:
>>>>> Precondition
>>>>> The cluster has no load but some node's JVM crashed.
>>>>> 
>>>>> Expected actual
>>>>> The user performs an operation (eg. cache put) related to this node
>> (via
>>>>> another node) and waits for some timeout to gain it's dead.
>>>>> The cluster starts the switch to relocate primary partitions to alive
>>>>> nodes.
>>>>> Now user able to retry the operation.
>>>>> 
>>>>> Desired
>>>>> Some WatchDog checks nodes aliveness on a regular basis.
>>>>> Once a failure detected, the cluster starts the switch.
>>>>> Later, the user performs an operation on an already fixed cluster and
>>>>> waits for nothing.
>>>>> 
>>>>> It would be good news if the "Desired" case is already Actual.
>>>>> Can somebody point to the feature that performs this check?
>>>> 
>>>> 
>>>> 
>> 
>> 
>>

Re: Active nodes aliveness WatchDog

Posted by Anton Vinogradov <av...@apache.org>.

It seems you're talking about Failure Detection (Timeouts).
Will it detect node failure on still cluster?

On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
stephen.darlington@gridgain.com> wrote:

> The configuration parameters that I’m aware of are here:
>
>
> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
>
> Other people would be better placed to discuss the internals.
>
> Regards.
> Stephen
>
> > On 8 Apr 2020, at 09:32, Anton Vinogradov <av...@apache.org> wrote:
> >
> > Stephen,
> >
> >> Nodes check on their neighbours and notify the remaining nodes if one
> > disappears.
> > Could you explain how this works in detail?
> > How can I set/change check frequency?
> >
> > On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
> > stephen.darlington@gridgain.com> wrote:
> >
> >> This is one of the functions of the DiscoverySPI. Nodes check on their
> >> neighbours and notify the remaining nodes if one disappears. When the
> >> topology changes, it triggers a rebalance, which relocates primary
> >> partitions to live nodes. This is entirely transparent to clients.
> >>
> >> It gets more complex… like there’s the partition loss policy and
> >> rebalancing doesn’t always happen (configurable, persistence, etc)… but
> >> broadly it does as you expect.
> >>
> >> Regards,
> >> Stephen
> >>
> >>> On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
> >>>
> >>> Igniters,
> >>> Do we have some feature allows to check nodes aliveness on a regular
> >> basis?
> >>>
> >>> Scenario:
> >>> Precondition
> >>> The cluster has no load but some node's JVM crashed.
> >>>
> >>> Expected actual
> >>> The user performs an operation (eg. cache put) related to this node
> (via
> >>> another node) and waits for some timeout to gain it's dead.
> >>> The cluster starts the switch to relocate primary partitions to alive
> >>> nodes.
> >>> Now user able to retry the operation.
> >>>
> >>> Desired
> >>> Some WatchDog checks nodes aliveness on a regular basis.
> >>> Once a failure detected, the cluster starts the switch.
> >>> Later, the user performs an operation on an already fixed cluster and
> >>> waits for nothing.
> >>>
> >>> It would be good news if the "Desired" case is already Actual.
> >>> Can somebody point to the feature that performs this check?
> >>
> >>
> >>
>
>
>

Re: Active nodes aliveness WatchDog

Posted by Stephen Darlington <st...@gridgain.com>.

The configuration parameters that I’m aware of are here:

https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html

Other people would be better placed to discuss the internals.

Regards.
Stephen

> On 8 Apr 2020, at 09:32, Anton Vinogradov <av...@apache.org> wrote:
> 
> Stephen,
> 
>> Nodes check on their neighbours and notify the remaining nodes if one
> disappears.
> Could you explain how this works in detail?
> How can I set/change check frequency?
> 
> On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
> stephen.darlington@gridgain.com> wrote:
> 
>> This is one of the functions of the DiscoverySPI. Nodes check on their
>> neighbours and notify the remaining nodes if one disappears. When the
>> topology changes, it triggers a rebalance, which relocates primary
>> partitions to live nodes. This is entirely transparent to clients.
>> 
>> It gets more complex… like there’s the partition loss policy and
>> rebalancing doesn’t always happen (configurable, persistence, etc)… but
>> broadly it does as you expect.
>> 
>> Regards,
>> Stephen
>> 
>>> On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
>>> 
>>> Igniters,
>>> Do we have some feature allows to check nodes aliveness on a regular
>> basis?
>>> 
>>> Scenario:
>>> Precondition
>>> The cluster has no load but some node's JVM crashed.
>>> 
>>> Expected actual
>>> The user performs an operation (eg. cache put) related to this node (via
>>> another node) and waits for some timeout to gain it's dead.
>>> The cluster starts the switch to relocate primary partitions to alive
>>> nodes.
>>> Now user able to retry the operation.
>>> 
>>> Desired
>>> Some WatchDog checks nodes aliveness on a regular basis.
>>> Once a failure detected, the cluster starts the switch.
>>> Later, the user performs an operation on an already fixed cluster and
>>> waits for nothing.
>>> 
>>> It would be good news if the "Desired" case is already Actual.
>>> Can somebody point to the feature that performs this check?
>> 
>> 
>>

Re: Active nodes aliveness WatchDog

Posted by Anton Vinogradov <av...@apache.org>.

Stephen,

> Nodes check on their neighbours and notify the remaining nodes if one
disappears.
Could you explain how this works in detail?
How can I set/change check frequency?

On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
stephen.darlington@gridgain.com> wrote:

> This is one of the functions of the DiscoverySPI. Nodes check on their
> neighbours and notify the remaining nodes if one disappears. When the
> topology changes, it triggers a rebalance, which relocates primary
> partitions to live nodes. This is entirely transparent to clients.
>
> It gets more complex… like there’s the partition loss policy and
> rebalancing doesn’t always happen (configurable, persistence, etc)… but
> broadly it does as you expect.
>
> Regards,
> Stephen
>
> > On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
> >
> > Igniters,
> > Do we have some feature allows to check nodes aliveness on a regular
> basis?
> >
> > Scenario:
> > Precondition
> >  The cluster has no load but some node's JVM crashed.
> >
> > Expected actual
> >  The user performs an operation (eg. cache put) related to this node (via
> > another node) and waits for some timeout to gain it's dead.
> >  The cluster starts the switch to relocate primary partitions to alive
> > nodes.
> >  Now user able to retry the operation.
> >
> > Desired
> >  Some WatchDog checks nodes aliveness on a regular basis.
> >  Once a failure detected, the cluster starts the switch.
> >  Later, the user performs an operation on an already fixed cluster and
> > waits for nothing.
> >
> > It would be good news if the "Desired" case is already Actual.
> > Can somebody point to the feature that performs this check?
>
>
>

Re: Active nodes aliveness WatchDog

Posted by Stephen Darlington <st...@gridgain.com>.

This is one of the functions of the DiscoverySPI. Nodes check on their neighbours and notify the remaining nodes if one disappears. When the topology changes, it triggers a rebalance, which relocates primary partitions to live nodes. This is entirely transparent to clients.

It gets more complex… like there’s the partition loss policy and rebalancing doesn’t always happen (configurable, persistence, etc)… but broadly it does as you expect.

Regards,
Stephen

> On 8 Apr 2020, at 08:40, Anton Vinogradov <av...@apache.org> wrote:
> 
> Igniters,
> Do we have some feature allows to check nodes aliveness on a regular basis?
> 
> Scenario:
> Precondition
>  The cluster has no load but some node's JVM crashed.
> 
> Expected actual
>  The user performs an operation (eg. cache put) related to this node (via
> another node) and waits for some timeout to gain it's dead.
>  The cluster starts the switch to relocate primary partitions to alive
> nodes.
>  Now user able to retry the operation.
> 
> Desired
>  Some WatchDog checks nodes aliveness on a regular basis.
>  Once a failure detected, the cluster starts the switch.
>  Later, the user performs an operation on an already fixed cluster and
> waits for nothing.
> 
> It would be good news if the "Desired" case is already Actual.
> Can somebody point to the feature that performs this check?