You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Pavel Tupitsyn (JIRA)" <ji...@apache.org> on 2016/08/09 12:32:32 UTC

[jira] [Updated] (IGNITE-2656) Documentation on debugging and fixing the reasons of node disconnection from the cluster

     [ https://issues.apache.org/jira/browse/IGNITE-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pavel Tupitsyn updated IGNITE-2656:
-----------------------------------
    Fix Version/s:     (was: 1.7)
                   1.8

> Documentation on debugging and fixing the reasons of node disconnection from the cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-2656
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2656
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Magda
>            Assignee: Denis Magda
>            Priority: Critical
>             Fix For: 1.8
>
>
> Sometimes a node can be abruptly kicked off from the cluster buy some reason.
> The documentation must contain information on how to get to the root of the issue by looking at logs files. Usually the node that was kicked off contains "Local node segmented" message and the node that failed its next neighbor contains a message with more details "Failed to send message to next node".
> Next the article must list possible reasons of the disconnection:
> - long GC pauses. Give recommendations on how to check;
> - high node utilization so that it responds with a delay;
> - low network configuration parameters that are not suited for an environment;
> There should be a section about {{IgniteConfiguration.failureDetectionTimeout}} describing its behavior and showing all its pros and cons.
> The article must say when it makes sense to 'disable' this timeout by switching to explicit configuration of TcpDiscoverySpi.socketTimeout, TcpDiscoverySpi.ackTimeout, TcpDiscoverySpi.maxAckTimeout, TcpDiscoverySpi.reconnectCount. Pros and cons of manual configuration has to be mentioned as well.
>   
> Also I would list the usage of TcpDiscoverySpi.joinTimeout,
> TcpDiscoverySpi.networkTimeout (used on client reconnect, servers waits for join result, node stop, socket reader first message.) there as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)