You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Ilya Kasnacheev <il...@gmail.com> on 2019/07/01 13:45:17 UTC
Re: TcpCommunicationSpi failed to establish connection to node, node
will be dropped from cluster
Hello!
This looks like genuine network connectivity issues between nodes in your
cluster. Maybe docker ports are not configured correctly?
Please consider using -Djava.net.preferIPv4Stack=true to force IPv4 between
nodes.
Regards,
--
Ilya Kasnacheev
пт, 28 июн. 2019 г. в 11:15, wiltu <ph...@hotmail.com>:
> Hi,
> I am using Apache Ignite 1.9.0, and made the topology with 6 nodes on VM,
> it
> is stable.
> but moved to docker latest, when run some time ago, the exception was
> occurring by one node.
> one node drop from cluster, and huge amount of TIME_WAIT connections.
> I want to know what i have to do , what i have to check.
>
> Thanks in advance.
> Regards,
> Wilson
>
>
> --------------------------
> That is iginte-config.xml.
> --------------------------
> <?xml version="1.0" encoding="UTF-8"?>
>
>
>
>
> <beans xmlns="http://www.springframework.org/schema/beans"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="
> http://www.springframework.org/schema/beans
> http://www.springframework.org/schema/beans/spring-beans.xsd">
> <bean id="ignite.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>
>
> <property name="userAttributes">
> <map>
> <entry key="ROLE_MASTER" value="worker" />
> <entry key="ROLE_TRANSACTION"
> value="worker" />
> </map>
> </property>
>
>
> <property name="publicThreadPoolSize" value="128" />
>
> <property name="systemThreadPoolSize" value="64" />
>
> <property name="communicationSpi">
> <bean
> class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
>
> <property name="localPort" value="48588" />
> <property name="socketWriteTimeout"
> value="30000" />
> <property name="messageQueueLimit"
> value="1024" />
> </bean>
> </property>
>
> <property name="discoverySpi">
> <bean
> class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
> <property name="localPort" value="48505" />
> <property name="localPortRange" value="2"
> />
>
> <property name="networkTimeout"
> value="60000" />
> <property name="ackTimeout" value="30000"
> />
> <property name="socketTimeout"
> value="10000" />
> <property name="reconnectCount"
> value="100" />
>
> <property name="ipFinder">
> <bean
>
>
> class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
>
> <property name="addresses">
> <list>
>
> <value>10.1.x.242:48505..48506</value>
>
> <value>10.1.x.243:48505..48506</value>
>
> <value>10.1.x.244:48505..48506</value>
> </list>
> </property>
> </bean>
> </property>
> </bean>
> </property>
> </bean>
> </beans>
>
>
> --------------------------
> That is exception log.
> --------------------------
> [21:26:07,108][INFO][grid-timeout-worker-#55%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
> ^-- Node [id=2f8e54aa, name=null, uptime=14:54:04:809]
> ^-- H/N/C [hosts=3, nodes=9, CPUs=144]
> ^-- CPU [cur=0.03%, avg=0.24%, GC=0%]
> ^-- Heap [used=2359MB, free=90.4%, comm=16384MB]
> ^-- Non heap [used=79MB, free=-1%, comm=82MB]
> ^-- Public thread pool [active=0, idle=48, qSize=0]
> ^-- System thread pool [active=0, idle=48, qSize=0]
> ^-- Outbound messages queue [size=0]
> [21:26:39,181][WARNING][sys-stripe-43-#44%null%][TcpCommunicationSpi]
> Connect timed out (consider increasing 'failureDetectionTimeout'
> configuration property) [addr=/172.20.0.1:48588,
> failureDetectionTimeout=10000]
> [21:26:40,757][WARNING][grid-timeout-worker-#55%null%][G] >>> Possible
> starvation in striped pool: sys-stripe-43-#44%null%
> []
> deadlock: false
> completed: 12520
> Thread [name="sys-stripe-43-#44%null%", id=69, state=RUNNABLE,
> blockCnt=779,
> waitCnt=12399]
> at sun.nio.ch.Net.poll(Native Method)
> at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
> at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
> at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2822)
> at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2587)
> at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2479)
> at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2340)
> at
>
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2304)
> at
> o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1293)
> at
> o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1362)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:910)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1059)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicAbstractUpdateFuture.map(GridDhtAtomicAbstractUpdateFuture.java:358)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1970)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1721)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3181)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$1600(GridDhtAtomicCache.java:126)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:338)
> at
>
> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:333)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:827)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:369)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:293)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:95)
> at
>
> o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:238)
> at
>
> o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1222)
> at
>
> o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:850)
> at
>
> o.a.i.i.managers.communication.GridIoManager.access$2100(GridIoManager.java:108)
> at
> o.a.i.i.managers.communication.GridIoManager$7.run(GridIoManager.java:790)
> at
> o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:428)
> at java.lang.Thread.run(Thread.java:745)
>
> [21:26:59,319][SEVERE][tcp-disco-sock-reader-#4%null%][TcpDiscoverySpi]
> Caught exception on message read
> [sock=Socket[addr=/10.1.41.242,port=49985,localport=48506],
> locNodeId=2f8e54aa-5546-4b83-8abf-f4651bf76276,
> rmtNodeId=e0da5175-9f43-427a-b8be-7e588ed78492]
> class org.apache.ignite.IgniteCheckedException: Failed to deserialize
> object
> with given class loader: sun.misc.Launcher$AppClassLoader@764c12b6
> at
>
> org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:127)
> at
>
> org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
> at
>
> org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9765)
> at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5790)
> at
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> Caused by: java.io.EOFException
> at
>
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)
> at
>
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
> at
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
> at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
> at
>
> org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.<init>(JdkMarshallerObjectInputStream.java:39)
> at
>
> org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:117)
> ... 4 more
> [21:26:59,600][WARNING][sys-stripe-43-#44%null%][TcpCommunicationSpi]
> *TcpCommunicationSpi failed to establish connection to node, node will be
> dropped from cluster* [rmtNode=TcpDiscoveryNode
> [id=6871b848-3b2e-495a-a0a7-81ce466f3139, addrs=[0:0:0:0:0:0:0:1%lo,
> 10.1.41.244, 127.0.0.1, 172.17.0.1, 172.20.0.1],
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48505, /127.0.0.1:48505, /172.17.0.1:48505,
> /10.1.41.244:48505, /172.20.0.1:48505], discPort=48505, order=1,
> intOrder=1,
> lastExchangeTime=1561617110668, loc=false,
> ver=1.9.0#20170302-sha1:a8169d0a,
> isClient=false], err=class o.a.i.IgniteCheckedException: Failed to connect
> to node (is node still alive?). Make sure that each ComputeTask and cache
> Transaction has a timeout set in order to prevent parties from waiting
> forever in case of network issues
> [nodeId=6871b848-3b2e-495a-a0a7-81ce466f3139, addrs=[/172.20.0.1:48588,
> s7spkhdp04.buyabs.corp/10.1.41.244:48588, /172.17.0.1:48588,
> /0:0:0:0:0:0:0:1%lo:48588, /127.0.0.1:48588]], connectErrs=[class
> o.a.i.IgniteCheckedException: Failed to connect to address:
> /172.20.0.1:48588, class o.a.i.IgniteCheckedException: Failed to connect
> to
> address: s7spkhdp04.buyabs.corp/10.1.41.244:48588, class
> o.a.i.IgniteCheckedException: Failed to connect to address:
> /172.17.0.1:48588, class o.a.i.IgniteCheckedException: Failed to connect
> to
> address: /0:0:0:0:0:0:0:1%lo:48588, class o.a.i.IgniteCheckedException:
> Failed to connect to address: /127.0.0.1:48588]]
> [21:27:07,119][INFO][grid-timeout-worker-#55%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
> ^-- Node [id=2f8e54aa, name=null, uptime=14:55:04:816]
> ^-- H/N/C [hosts=3, nodes=9, CPUs=144]
> ^-- CPU [cur=1.8%, avg=0.24%, GC=0%]
> ^-- Heap [used=3847MB, free=84.34%, comm=16384MB]
> ^-- Non heap [used=81MB, free=-1%, comm=83MB]
> ^-- Public thread pool [active=0, idle=39, qSize=0]
> ^-- System thread pool [active=0, idle=10, qSize=0]
> ^-- Outbound messages queue [size=0]
> [21:27:10,763][WARNING][grid-timeout-worker-#55%null%][G] >>> Possible
> starvation in striped pool: sys-stripe-35-#36%null%
>
> ---------------------------------------------
> huge amount of TIME_WAIT connections
> ---------------------------------------------
> ~]$ netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 114 CLOSE_WAIT
> 220 ESTABLISHED
> 49 LISTEN
> 45018 TIME_WAIT
>
> ~]$ netstat -a
> tcp6 0 0 localhost:48588 localhost:47334
> TIME_WAIT
> tcp6 0 0 hostname:48588 hostname:44356 TIME_WAIT
> tcp6 0 0 localhost:48505 localhost:49153
> TIME_WAIT
> tcp6 0 0 hostname:48588 hostname:39694 TIME_WAIT
> tcp6 0 0 172.19.0.1:48588 hostname:36024 TIME_WAIT
> tcp6 0 0 localhost:48588 localhost:53930
> TIME_WAIT
> .....
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
Re: TcpCommunicationSpi failed to establish connection to node, node
will be dropped from cluster
Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!
Since you can use Java system properties in XMLs and you can specify those
on cmdline, should be not hard to automate DevOps here.
Glad that you have it working.
Regards,
--
Ilya Kasnacheev
пн, 8 июл. 2019 г. в 10:57, wiltu <ph...@hotmail.com>:
> Hello!
> Yep, "172.20.0.1" is a docker network IP, like this :
> =============================================
> 6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
> state
> DOWN group default
> link/ether 02:42:56:dd:06:b5 brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
> valid_lft forever preferred_lft forever
> 11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP group default
> link/ether 02:42:7f:c3:f0:89 brd ff:ff:ff:ff:ff:ff
> inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
> valid_lft forever preferred_lft forever
> inet6 fe80::42:7fff:fec3:f089/64 scope link
> valid_lft forever preferred_lft forever
> =============================================
>
> My docker use host network driver for container, and this cluster can work
> for a while (maybe up 12 hours ) , the means TcpCommunicationSpi is working
> at the beginning. Maybe specify localAddress can make the cluster to work,
> but does not seem to fit the principle of auto DevOps.
> Then I tried to change the cluster to another machine, it live 2 days and
> looks good.
> I want it to run for a few days under observation.
>
> Thank you very much, I have benefited a lot from your suggestion, Please
> let
> me know if you have any suggestions.
>
> Regards,
> Wilson
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
Re: TcpCommunicationSpi failed to establish connection to node,
node will be dropped from cluster
Posted by wiltu <ph...@hotmail.com>.
Hello!
Yep, "172.20.0.1" is a docker network IP, like this :
=============================================
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state
DOWN group default
link/ether 02:42:56:dd:06:b5 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP group default
link/ether 02:42:7f:c3:f0:89 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
valid_lft forever preferred_lft forever
inet6 fe80::42:7fff:fec3:f089/64 scope link
valid_lft forever preferred_lft forever
=============================================
My docker use host network driver for container, and this cluster can work
for a while (maybe up 12 hours ) , the means TcpCommunicationSpi is working
at the beginning. Maybe specify localAddress can make the cluster to work,
but does not seem to fit the principle of auto DevOps.
Then I tried to change the cluster to another machine, it live 2 days and
looks good.
I want it to run for a few days under observation.
Thank you very much, I have benefited a lot from your suggestion, Please let
me know if you have any suggestions.
Regards,
Wilson
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: TcpCommunicationSpi failed to establish connection to node, node
will be dropped from cluster
Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!
[11:09:05,401][WARNING][sys-stripe-5-#6%null%][TcpCommunicationSpi] Connect
timed out (consider increasing 'failureDetectionTimeout' configuration
property) [*addr=/172.20.0.1:48588 <http://172.20.0.1:48588>*,
failureDetectionTimeout=30000]
[11:13:08,565][WARNING][grid-timeout-worker-#55%null%][G] >>> Possible
starvation in striped pool: sys-stripe-16-#17%null%
[]
deadlock: false
completed: 1250
Thread [name="sys-stripe-16-#17%null%", id=42, state=RUNNABLE, blockCnt=24,
waitCnt=1243]
* at sun.nio.ch.Net.poll(Native Method) at
sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954) at
sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110) at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2822)*
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2587)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2479)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2340)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2304)
at
o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1293)
at
o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1362)
at
o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:910)
at
o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1059)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicAbstractUpdateFuture.map(GridDhtAtomicAbstractUpdateFuture.java:358)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1970)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1721)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3181)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$1600(GridDhtAtomicCache.java:126)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:338)
at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$6.apply(GridDhtAtomicCache.java:333)
at
o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:827)
at
o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:369)
at
o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:293)
at
o.a.i.i.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:95)
at
o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:238)
at
o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1222)
at
o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:850)
at
o.a.i.i.managers.communication.GridIoManager.access$2100(GridIoManager.java:108)
at
o.a.i.i.managers.communication.GridIoManager$7.run(GridIoManager.java:790)
at o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:428)
at java.lang.Thread.run(Thread.java:745)
This probably means there are still connectivity problems. Isn't
*172.20.0.1* the Docker gateway or something like that? I think you should
specify localAddress on TcpCommunicationSpi of all your nodes to match
their external interface.
Regards,
--
Ilya Kasnacheev
ср, 3 июл. 2019 г. в 05:25, wiltu <ph...@hotmail.com>:
> Hi Ilya
> Thanks for your response. I think you are right.
> So, I tried to set "-Djava.net.preferIPv4Stack=true" and there isn't a huge
> amount of TIME_WAIT connections, but still has one node dropped from the
> cluster.
>
> ignite log :
> ignite-a228fd5c.log
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2492/ignite-a228fd5c.log>
>
>
> Regards,
> Wilson
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
Re: TcpCommunicationSpi failed to establish connection to node,
node will be dropped from cluster
Posted by wiltu <ph...@hotmail.com>.
Hi Ilya
Thanks for your response. I think you are right.
So, I tried to set "-Djava.net.preferIPv4Stack=true" and there isn't a huge
amount of TIME_WAIT connections, but still has one node dropped from the
cluster.
ignite log :
ignite-a228fd5c.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2492/ignite-a228fd5c.log>
Regards,
Wilson
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/