You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Dmitry Karachentsev (JIRA)" <ji...@apache.org> on 2018/07/13 08:49:00 UTC

[jira] [Commented] (IGNITE-8985) Node segmented itself after connRecoveryTimeout

    [ https://issues.apache.org/jira/browse/IGNITE-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542723#comment-16542723 ] 

Dmitry Karachentsev commented on IGNITE-8985:
---------------------------------------------

Here are few things that caused this behavior.
1. One node was killed.
2. Previous for it was unable to connect and tried to go to next of the killed.
3. As we have 60 secs of failure detection timeout, then connection check frequency will be 60 / 3 = 20 secs. So it means that previous node will be treated as failed if there was no message during 20 secs. In the other hand, recovery timeout is 10 secs.
4. Another case is that each node has two loopback addresses, when one of them 172.17.0.1:47500 is not determined as localhost and was checked. In other words node checked connection to itself.

To fix it should be applied loopback check from IGNITE-8683 ticket and add IGNITE-8944 to mark node as failed faster.

> Node segmented itself after connRecoveryTimeout
> -----------------------------------------------
>
>                 Key: IGNITE-8985
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8985
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Cherkasov
>            Assignee: Dmitry Karachentsev
>            Priority: Major
>         Attachments: Archive.zip
>
>
> I can see the following message in logs:
> [2018-07-10 16:27:13,111][WARN ][tcp-disco-msg-worker-#2] Unable to connect to next nodes in a ring, it seems local node is experiencing connectivity issues. Segmenting local node to avoid case when one node fails a big part of cluster. To disable that behavior set TcpDiscoverySpi.setConnectionRecoveryTimeout() to 0. [connRecoveryTimeout=10000, effectiveConnRecoveryTimeout=10000]
> [2018-07-10 16:27:13,112][WARN ][disco-event-worker-#61] Local node SEGMENTED: TcpDiscoveryNode [id=e1a19d8e-2253-458c-9757-e3372de3bef9, addrs=[127.0.0.1, 172.17.0.1, 172.25.1.17], sockAddrs=[/172.17.0.1:47500, lab17.gridgain.local/172.25.1.17:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1531229233103, loc=true, ver=2.4.7#20180710-sha1:a48ae923, isClient=false]
> I have failure detection time out 60_000 and during the test I had GC <25secs, so I don't expect that node should be segmented.
>  
> Logs are attached.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)