You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Prasad Bhalerao <pr...@gmail.com> on 2019/01/06 14:03:47 UTC

How to debug network issues in cluster

Hi,

I am consistently getting "Node is out of topology" message in logs on
node-1 and in other node, node-2 getting message "Timed out waiting for
message delivery receipt (most probably, the reason is in long GC pauses on
remote node; consider tuning GC and increasing '"

I have checked the network bandwidth using iperf and it is 470 Mbit per
sec. I have also checked the gc logs and max pause time is 140 ms.

If it is really happening because of network issues, it there any way to
debug it?

If it is happening because of gc, I would have seen it in gc logs.

Can someone please help me out with this?

Log messages on node-1:
2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO
o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection
[rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO
o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for
connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%]
INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node
connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651]
*2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%]
WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably,
due to short-time network problems).*
2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%]
WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED:
TcpDiscoveryNode [id=a5827f51-096a-4c98-af4f-564d2d3e769d,
addrs=[10.114.113.53, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500,
order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true,
ver=2.7.0#20181130-sha1:256ae401, isClient=false]
2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%]
INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node
connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651
2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO
o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node:
cd9803ac-b810-447e-818e-ab51dada59d8

RE: How to debug network issues in cluster

Posted by Stanislav Lukyanov <st...@gmail.com>.

+1 to all points.

Generally, the message “Local node SEGMENTED” generally means that the cluster decided that the node is dead and kicked it out.
The next time the node tried to send a message to the cluster, it received an answer “you’re segmented” meaning “we’ve kicked you out, sorry”.
It usually happens when the node is unavailable for some time – either due to GC, network issues, OS/supervisor not giving the node CPU time, etc.
The primary remedy for this issue is indeed increasing failureDetectionTimeout.

Stan

From: Loredana Radulescu Ivanoff
Sent: 7 января 2019 г. 20:29
To: user@ignite.apache.org
Subject: Re: How to debug network issues in cluster

As an Ignite user, here are my two cents:

- if you were never able to get the node to join the cluster, check that there are no firewalls/rules blocking the Ignite ports (telnet might be a quick way to do that)
- check that the IPs printed by TcpDiscoverySpi are the correct ones; if you have virtual network adapters enabled then the wrong IP might be chosen, so the IP discovery will fail. This can happen if you use VirtualBox or Docker, for instance.
- for intermittent issues, you can try increasing the default failure detection timeout, which is 10s, I think. Somewhere in the Ignite doc it's recommended to use 30s if the JVM is on AWS.
- how did you configure IP discovery? In my case, I've always used static IP discovery with shared enabled - TcpDiscoveryVmIpFinder 

On Sun, Jan 6, 2019 at 6:04 AM Prasad Bhalerao <pr...@gmail.com> wrote:
Hi,

I am consistently getting "Node is out of topology" message in logs on node-1 and in other node, node-2 getting message "Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing '"

I have checked the network bandwidth using iperf and it is 470 Mbit per sec. I have also checked the gc logs and max pause time is 140 ms.

If it is really happening because of network issues, it there any way to debug it?

If it is happening because of gc, I would have seen it in gc logs.

Can someone please help me out with this? 

Log messages on node-1:
2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651]
2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%] WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems).
2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=a5827f51-096a-4c98-af4f-564d2d3e769d, addrs=[10.114.113.53, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false]
2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651
2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node: cd9803ac-b810-447e-818e-ab51dada59d8

Re: How to debug network issues in cluster

Posted by Loredana Radulescu Ivanoff <lr...@tibco.com>.

As an Ignite user, here are my two cents:

- if you were never able to get the node to join the cluster, check that
there are no firewalls/rules blocking the Ignite ports (telnet might be a
quick way to do that)
- check that the IPs printed by TcpDiscoverySpi are the correct ones; if
you have virtual network adapters enabled then the wrong IP might be
chosen, so the IP discovery will fail. This can happen if you use
VirtualBox or Docker, for instance.
- for intermittent issues, you can try increasing the default failure
detection timeout, which is 10s, I think. Somewhere in the Ignite doc it's
recommended to use 30s if the JVM is on AWS.
- how did you configure IP discovery? In my case, I've always used static
IP discovery with shared enabled - TcpDiscoveryVmIpFinder

On Sun, Jan 6, 2019 at 6:04 AM Prasad Bhalerao <pr...@gmail.com>
wrote:

> Hi,
>
> I am consistently getting "Node is out of topology" message in logs on
> node-1 and in other node, node-2 getting message "Timed out waiting for
> message delivery receipt (most probably, the reason is in long GC pauses on
> remote node; consider tuning GC and increasing '"
>
> I have checked the network bandwidth using iperf and it is 470 Mbit per
> sec. I have also checked the gc logs and max pause time is 140 ms.
>
> If it is really happening because of network issues, it there any way to
> debug it?
>
> If it is happening because of gc, I would have seen it in gc logs.
>
> Can someone please help me out with this?
>
> Log messages on node-1:
> 2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO
> o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection
> [rmtAddr=/10.114.113.65, rmtPort=35651]
> 2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO
> o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for
> connection [rmtAddr=/10.114.113.65, rmtPort=35651]
> 2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%]
> INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node
> connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651]
> *2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%]
> WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably,
> due to short-time network problems).*
> 2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%]
> WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED:
> TcpDiscoveryNode [id=a5827f51-096a-4c98-af4f-564d2d3e769d,
> addrs=[10.114.113.53, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
> qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500,
> order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true,
> ver=2.7.0#20181130-sha1:256ae401, isClient=false]
> 2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%]
> INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node
> connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651
> 2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO
> o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node:
> cd9803ac-b810-447e-818e-ab51dada59d8
>
>