You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/25 21:43:01 UTC
[jira] [Commented] (IGNITE-6700) Node considered as failed can
cause failure of others nodes
[ https://issues.apache.org/jira/browse/IGNITE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219591#comment-16219591 ]
ASF GitHub Bot commented on IGNITE-6700:
----------------------------------------
GitHub user akuramshingg opened a pull request:
https://github.com/apache/ignite/pull/2928
IGNITE-6700 Node considered as failed can cause failure of others nodes
Added TcpDiscoverySplitTest, updated IgniteCacheTopologySplitAbstractTest
Early previous node fail with more reliable connection check (keep-alive) algorithm
Fast node failed message transmission, reduced split detection & exchange delay
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gridgain/apache-ignite ignite-6700-new
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/ignite/pull/2928.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2928
----
commit 43e3282b278d4214b44a4b10d21a7ed3793fc64c
Author: Alexandr Kuramshin <ei...@gmail.com>
Date: 2017-10-25T21:40:03Z
IGNITE-6700 Node considered as failed can cause failure of others nodes
Added TcpDiscoverySplitTest, updated IgniteCacheTopologySplitAbstractTest
Early previous node fail with more reliable connection check (keep-alive) algorithm
Fast node failed message transmission, reduced split detection & exchange delay
----
> Node considered as failed can cause failure of others nodes
> -----------------------------------------------------------
>
> Key: IGNITE-6700
> URL: https://issues.apache.org/jira/browse/IGNITE-6700
> Project: Ignite
> Issue Type: Bug
> Security Level: Public(Viewable by anyone)
> Components: general
> Reporter: Semen Boikov
> Assignee: Semen Boikov
> Priority: Critical
>
> Node considered as failed can cause failure of others nodes in cluster.
> There is an issue in TcpDiscoveryAbstractMessage.failedNodes processing, if message is received from node considered as failed, then failedNodes should be ignored.
> Possible scenario:
> - there are 4 nodes (1 -> 2 -> 3 -> 4)
> - node 3 temporary lost connection with others
> - node 2 considers 3 as failed, node failed event is fired for 3
> - node 3 considers 4 as failed, adds 4 in nodeFailedList, then it restores connection with 1 and currently 1 will process nodeFailedList from 3 (even if 3 is already considered as failed)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)