You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by "ilya.kasnacheev" <il...@gmail.com> on 2019/01/14 14:37:22 UTC

Re: Local node SEGMENTED error causing node goes down for no obvious reason

Hello!

I can see that at 2018-11-07T07:54:47 five new clients will suddently arrive
and then five server nodes will drop (not being able to acknowledge addition
of new nodes perhaps?).

Either there is communication problems between new clients and old servers,
or some bug manifested by rapid inclusion of multiple clients.

Regards,



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Local node SEGMENTED error causing node goes down for no obvious reason

Posted by Yakov Zhdanov <yz...@apache.org>.

It seems there were issues with network. You can see plenty of discovery
warnings in logs of the kind (231.log):

[2018-11-07T07:44:44,627][WARN ][grid-timeout-worker-#119][TcpDiscoverySpi]
Socket write has timed out (consider increasing
'IgniteConfiguration.failureDetectionTimeout' configuration property)
[failureDetectionTimeout=60000, rmtAddr=/10.29.42.232:49500, rmtPort=49500,
sockTimeout=5000]
[2018-11-07T07:44:44,630][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi]
Failed to send message to next node [msg=TcpDiscoveryClientReconnectMessage
[routerNodeId=9a4ee928-a71d-484b-88cc-2ded8efb7b1d, lastMsgId=null,
super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=005b10de661-88aac721-2b85-432b-b703-ca6aff5252c6,
verifierNodeId=a7685ff7-78b6-442c-a819-f8a5b2547623, topVer=0,
pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode
[id=e940c0d9-15f7-46ab-be95-ee2302ccc8f4, addrs=[10.29.42.232], sockAddrs=[/
10.29.42.232:49500], discPort=49500, order=2, intOrder=2,
lastExchangeTime=1541574587361, loc=false,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false], errMsg=Failed to send
message to next node [msg=TcpDiscoveryClientReconnectMessage
[routerNodeId=9a4ee928-a71d-484b-88cc-2ded8efb7b1d, lastMsgId=null,
super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=005b10de661-88aac721-2b85-432b-b703-ca6aff5252c6,
verifierNodeId=a7685ff7-78b6-442c-a819-f8a5b2547623, topVer=0,
pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode
[id=e940c0d9-15f7-46ab-be95-ee2302ccc8f4, order=2, addr=[10.29.42.232],
daemon=false]]]

Node with order = 1 (231 one) kicked other server nodes out of the topology
for some reason we need to figure out.

What environment do you run Ignite on?

One more strange thing I see in logs is this

[2018-11-07T07:24:55,542][INFO
][tcp-disco-sock-reader-#146][TcpDiscoverySpi] Started serving remote node
connection [rmtAddr=/10.29.42.231:28977, rmtPort=28977]
[2018-11-07T07:24:55,547][INFO ][exchange-worker-#162][time] Started
exchange init [topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0],
crd=true, evt=NODE_JOINED, evtNode=23c738f5-fbbf-44dc-a5fb-5d09933d9c4b,
customEvt=null, allowMerge=true]

But I do not see Node Added event log for node with this ID. Message should
be like this;
[2018-11-07T07:20:56,050][INFO
][disco-event-worker-#161][GridDiscoveryManager] Added new node to
topology: TcpDiscoveryNode [id=a4acc241-dfa6-44bd-a62b-ebce2f68d199,
addrs=[10.29.42.49, 127.0.0.1], sockAddrs=[/10.29.42.49:0, /127.0.0.1:0],
discPort=0, order=14, intOrder=13, lastExchangeTime=1541575169947,
loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=true]

Ilya, canyou please take a look at the logs one more time?

Ray, can you please reproduce the issue with DEBUG turned on for discovery?
Please also fix all the warnings of the kind
[2018-11-07T07:11:37,701][WARN
][disco-event-worker-#161][GridDiscoveryManager] Local node's value of
'java.net.preferIPv4Stack' system property differs from remote node's (all
nodes in topology should have identical value) [locPreferIpV4=null,
rmtPreferIpV4=null, locId8=e940c0d9, rmtId8=7e10c6c4,
rmtAddrs=[sap-zookeeper3/10.29.42.43, /127.0.0.1], rmtNode=ClusterNode
[id=7e10c6c4-3137-4640-a673-71f30d66d0e3, order=8, addr=[10.29.42.43,
127.0.0.1], daemon=false]]

--Yakov