You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Ilya Kasnacheev <il...@gmail.com> on 2019/07/01 13:45:17 UTC

Re: TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster

Hello!

This looks like genuine network connectivity issues between nodes in your
cluster. Maybe docker ports are not configured correctly?

Please consider using -Djava.net.preferIPv4Stack=true to force IPv4 between
nodes.

Regards,
-- 
Ilya Kasnacheev


пт, 28 июн. 2019 г. в 11:15, wiltu <ph...@hotmail.com>:

> Hi,
> I am using Apache Ignite 1.9.0, and made the topology with 6 nodes on VM,
> it
> is stable.
> but moved to docker latest, when run some time ago, the exception was
> occurring by one node.
> one node drop from cluster, and huge amount of TIME_WAIT connections.
> I want to know what i have to do , what i have to check.
>
> Thanks in advance.
> Regards,
> Wilson
>
>
> --------------------------
> That is iginte-config.xml.
> --------------------------
> <?xml version="1.0" encoding="UTF-8"?>
>
>
>
>
> <beans xmlns="http://www.springframework.org/schema/beans"
>         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>         xsi:schemaLocation="
>         http://www.springframework.org/schema/beans
>         http://www.springframework.org/schema/beans/spring-beans.xsd">
>         <bean id="ignite.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>
>
>                 <property name="userAttributes">
>                         <map>
>                                 <entry key="ROLE_MASTER" value="worker" />
>                                 <entry key="ROLE_TRANSACTION"
> value="worker" />
>                         </map>
>                 </property>
>
>
>                 <property name="publicThreadPoolSize" value="128" />
>
>                 <property name="systemThreadPoolSize" value="64" />
>
>                 <property name="communicationSpi">
>                         <bean
> class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
>
>                                 <property name="localPort" value="48588" />
>                                 <property name="socketWriteTimeout"
> value="30000" />
>                                 <property name="messageQueueLimit"
> value="1024" />
>                         </bean>
>                 </property>
>
>                 <property name="discoverySpi">
>                         <bean
> class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
>                                 <property name="localPort" value="48505" />
>                                 <property name="localPortRange" value="2"
> />
>
>                                 <property name="networkTimeout"
> value="60000" />
>                                 <property name="ackTimeout" value="30000"
> />
>                                 <property name="socketTimeout"
> value="10000" />
>                                 <property name="reconnectCount"
> value="100" />
>
>                                 <property name="ipFinder">
>                                         <bean
>
>
> class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
>
>                                                 <property name="addresses">
>                                                         <list>
>
> <value>10.1.x.242:48505..48506</value>
>
> <value>10.1.x.243:48505..48506</value>
>
> <value>10.1.x.244:48505..48506</value>
>                                                         </list>
>                                                 </property>
>                                         </bean>
>                                 </property>
>                         </bean>
>                 </property>
>         </bean>
> </beans>
>
>
> --------------------------
> That is exception log.
> --------------------------
> [21:26:07,108][INFO][grid-timeout-worker-#55%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=2f8e54aa, name=null, uptime=14:54:04:809]
>     ^-- H/N/C [hosts=3, nodes=9, CPUs=144]
>     ^-- CPU [cur=0.03%, avg=0.24%, GC=0%]
>     ^-- Heap [used=2359MB, free=90.4%, comm=16384MB]
>     ^-- Non heap [used=79MB, free=-1%, comm=82MB]
>     ^-- Public thread pool [active=0, idle=48, qSize=0]
>     ^-- System thread pool [active=0, idle=48, qSize=0]
>     ^-- Outbound messages queue [size=0]
> [21:26:39,181][WARNING][sys-stripe-43-#44%null%][TcpCommunicationSpi]
> Connect timed out (consider increasing 'failureDetectionTimeout'
> configuration property) [addr=/172.20.0.1:48588,
> failureDetectionTimeout=10000]
> [21:26:40,757][WARNING][grid-timeout-worker-#55%null%][G] >>> Possible
> starvation in striped pool: sys-stripe-43-#44%null%
> []
> deadlock: false
> completed: 12520
> Thread [name="sys-stripe-43-#44%null%", id=69, state=RUNNABLE,
> blockCnt=779,
> waitCnt=12399]
>         at sun.nio.ch.Net.poll(Native Method)
>         at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
>         at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
>         at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2822)
>         at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2587)
>         at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2479)
>         at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2340)
>         at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2304)
>         at
> o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1293)
>         at
> o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1362)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:910)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1059)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicAbstractUpdateFuture.map(GridDhtAtomicAbstractUpdateFuture.java:358)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1970)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1721)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3181)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$1600(GridDhtAtomicCache.java:126)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:338)
>         at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:333)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:827)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:369)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:293)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:95)
>         at
>
> o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:238)
>         at
>
> o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1222)
>         at
>
> o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:850)
>         at
>
> o.a.i.i.managers.communication.GridIoManager.access$2100(GridIoManager.java:108)
>         at
> o.a.i.i.managers.communication.GridIoManager$7.run(GridIoManager.java:790)
>         at
> o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:428)
>         at java.lang.Thread.run(Thread.java:745)
>
> [21:26:59,319][SEVERE][tcp-disco-sock-reader-#4%null%][TcpDiscoverySpi]
> Caught exception on message read
> [sock=Socket[addr=/10.1.41.242,port=49985,localport=48506],
> locNodeId=2f8e54aa-5546-4b83-8abf-f4651bf76276,
> rmtNodeId=e0da5175-9f43-427a-b8be-7e588ed78492]
> class org.apache.ignite.IgniteCheckedException: Failed to deserialize
> object
> with given class loader: sun.misc.Launcher$AppClassLoader@764c12b6
>         at
>
> org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:127)
>         at
>
> org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
>         at
>
> org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9765)
>         at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5790)
>         at
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> Caused by: java.io.EOFException
>         at
>
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)
>         at
>
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
>         at
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
>         at
>
> org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.<init>(JdkMarshallerObjectInputStream.java:39)
>         at
>
> org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:117)
>         ... 4 more
> [21:26:59,600][WARNING][sys-stripe-43-#44%null%][TcpCommunicationSpi]
> *TcpCommunicationSpi failed to establish connection to node, node will be
> dropped from cluster* [rmtNode=TcpDiscoveryNode
> [id=6871b848-3b2e-495a-a0a7-81ce466f3139, addrs=[0:0:0:0:0:0:0:1%lo,
> 10.1.41.244, 127.0.0.1, 172.17.0.1, 172.20.0.1],
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48505, /127.0.0.1:48505, /172.17.0.1:48505,
> /10.1.41.244:48505, /172.20.0.1:48505], discPort=48505, order=1,
> intOrder=1,
> lastExchangeTime=1561617110668, loc=false,
> ver=1.9.0#20170302-sha1:a8169d0a,
> isClient=false], err=class o.a.i.IgniteCheckedException: Failed to connect
> to node (is node still alive?). Make sure that each ComputeTask and cache
> Transaction has a timeout set in order to prevent parties from waiting
> forever in case of network issues
> [nodeId=6871b848-3b2e-495a-a0a7-81ce466f3139, addrs=[/172.20.0.1:48588,
> s7spkhdp04.buyabs.corp/10.1.41.244:48588, /172.17.0.1:48588,
> /0:0:0:0:0:0:0:1%lo:48588, /127.0.0.1:48588]], connectErrs=[class
> o.a.i.IgniteCheckedException: Failed to connect to address:
> /172.20.0.1:48588, class o.a.i.IgniteCheckedException: Failed to connect
> to
> address: s7spkhdp04.buyabs.corp/10.1.41.244:48588, class
> o.a.i.IgniteCheckedException: Failed to connect to address:
> /172.17.0.1:48588, class o.a.i.IgniteCheckedException: Failed to connect
> to
> address: /0:0:0:0:0:0:0:1%lo:48588, class o.a.i.IgniteCheckedException:
> Failed to connect to address: /127.0.0.1:48588]]
> [21:27:07,119][INFO][grid-timeout-worker-#55%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=2f8e54aa, name=null, uptime=14:55:04:816]
>     ^-- H/N/C [hosts=3, nodes=9, CPUs=144]
>     ^-- CPU [cur=1.8%, avg=0.24%, GC=0%]
>     ^-- Heap [used=3847MB, free=84.34%, comm=16384MB]
>     ^-- Non heap [used=81MB, free=-1%, comm=83MB]
>     ^-- Public thread pool [active=0, idle=39, qSize=0]
>     ^-- System thread pool [active=0, idle=10, qSize=0]
>     ^-- Outbound messages queue [size=0]
> [21:27:10,763][WARNING][grid-timeout-worker-#55%null%][G] >>> Possible
> starvation in striped pool: sys-stripe-35-#36%null%
>
> ---------------------------------------------
> huge amount of TIME_WAIT connections
> ---------------------------------------------
>  ~]$ netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>     114 CLOSE_WAIT
>     220 ESTABLISHED
>      49 LISTEN
>   45018 TIME_WAIT
>
>  ~]$ netstat -a
> tcp6       0      0 localhost:48588         localhost:47334
> TIME_WAIT
> tcp6       0      0 hostname:48588        hostname:44356 TIME_WAIT
> tcp6       0      0 localhost:48505         localhost:49153
> TIME_WAIT
> tcp6       0      0 hostname:48588        hostname:39694 TIME_WAIT
> tcp6       0      0 172.19.0.1:48588      hostname:36024 TIME_WAIT
> tcp6       0      0 localhost:48588         localhost:53930
> TIME_WAIT
> .....
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Since you can use Java system properties in XMLs and you can specify those
on cmdline, should be not hard to automate DevOps here.

Glad that you have it working.

Regards,
-- 
Ilya Kasnacheev


пн, 8 июл. 2019 г. в 10:57, wiltu <ph...@hotmail.com>:

> Hello!
> Yep, "172.20.0.1" is a docker network IP, like this :
> =============================================
> 6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
> state
> DOWN group default
>     link/ether 02:42:56:dd:06:b5 brd ff:ff:ff:ff:ff:ff
>     inet 172.17.0.1/16 scope global docker0
>        valid_lft forever preferred_lft forever
> 11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP group default
>     link/ether 02:42:7f:c3:f0:89 brd ff:ff:ff:ff:ff:ff
>     inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
>        valid_lft forever preferred_lft forever
>     inet6 fe80::42:7fff:fec3:f089/64 scope link
>        valid_lft forever preferred_lft forever
> =============================================
>
> My docker use host network driver for container, and this cluster can work
> for a while (maybe up 12 hours ) , the means TcpCommunicationSpi is working
> at the beginning. Maybe specify localAddress can make the cluster to work,
> but does not seem to fit the principle of auto DevOps.
> Then I tried to change the cluster to another machine, it live 2 days and
> looks good.
> I want it to run for a few days under observation.
>
> Thank you very much, I have benefited a lot from your suggestion, Please
> let
> me know if you have any suggestions.
>
> Regards,
> Wilson
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster

Posted by wiltu <ph...@hotmail.com>.
Hello!
Yep, "172.20.0.1" is a docker network IP, like this :
=============================================
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state
DOWN group default
    link/ether 02:42:56:dd:06:b5 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP group default
    link/ether 02:42:7f:c3:f0:89 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
       valid_lft forever preferred_lft forever
    inet6 fe80::42:7fff:fec3:f089/64 scope link
       valid_lft forever preferred_lft forever
=============================================

My docker use host network driver for container, and this cluster can work
for a while (maybe up 12 hours ) , the means TcpCommunicationSpi is working
at the beginning. Maybe specify localAddress can make the cluster to work,
but does not seem to fit the principle of auto DevOps.
Then I tried to change the cluster to another machine, it live 2 days and
looks good.
I want it to run for a few days under observation.

Thank you very much, I have benefited a lot from your suggestion, Please let
me know if you have any suggestions.

Regards,
Wilson





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

[11:09:05,401][WARNING][sys-stripe-5-#6%null%][TcpCommunicationSpi] Connect
timed out (consider increasing 'failureDetectionTimeout' configuration
property) [*addr=/172.20.0.1:48588 <http://172.20.0.1:48588>*,
failureDetectionTimeout=30000]
[11:13:08,565][WARNING][grid-timeout-worker-#55%null%][G] >>> Possible
starvation in striped pool: sys-stripe-16-#17%null%
[]
deadlock: false
completed: 1250
Thread [name="sys-stripe-16-#17%null%", id=42, state=RUNNABLE, blockCnt=24,
waitCnt=1243]



*        at sun.nio.ch.Net.poll(Native Method)        at
sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)        at
sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2822)*
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2587)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2479)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2340)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2304)
        at
o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1293)
        at
o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1362)
        at
o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:910)
        at
o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1059)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicAbstractUpdateFuture.map(GridDhtAtomicAbstractUpdateFuture.java:358)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1970)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1721)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3181)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$1600(GridDhtAtomicCache.java:126)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:338)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:333)
        at
o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:827)
        at
o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:369)
        at
o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:293)
        at
o.a.i.i.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:95)
        at
o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:238)
        at
o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1222)
        at
o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:850)
        at
o.a.i.i.managers.communication.GridIoManager.access$2100(GridIoManager.java:108)
        at
o.a.i.i.managers.communication.GridIoManager$7.run(GridIoManager.java:790)
        at o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:428)
        at java.lang.Thread.run(Thread.java:745)

This probably means there are still connectivity problems. Isn't
*172.20.0.1* the Docker gateway or something like that? I think you should
specify localAddress on TcpCommunicationSpi of all your nodes to match
their external interface.

Regards,
-- 
Ilya Kasnacheev


ср, 3 июл. 2019 г. в 05:25, wiltu <ph...@hotmail.com>:

> Hi Ilya
> Thanks for your response. I think you are right.
> So, I tried to set "-Djava.net.preferIPv4Stack=true" and there isn't a huge
> amount of TIME_WAIT connections, but still has one node dropped from the
> cluster.
>
> ignite log :
> ignite-a228fd5c.log
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2492/ignite-a228fd5c.log>
>
>
> Regards,
> Wilson
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster

Posted by wiltu <ph...@hotmail.com>.
Hi Ilya
Thanks for your response. I think you are right.
So, I tried to set "-Djava.net.preferIPv4Stack=true" and there isn't a huge
amount of TIME_WAIT connections, but still has one node dropped from the
cluster.

ignite log : 
ignite-a228fd5c.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2492/ignite-a228fd5c.log>  

Regards,
Wilson





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/