You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Kukushkin (Jira)" <ji...@apache.org> on 2021/07/23 11:34:00 UTC

[jira] [Created] (IGNITE-15184) Ignite server cannot reconnect client: Close incoming connection, unknown node

Alexey Kukushkin created IGNITE-15184:
-----------------------------------------

             Summary: Ignite server cannot reconnect client: Close incoming connection, unknown node
                 Key: IGNITE-15184
                 URL: https://issues.apache.org/jira/browse/IGNITE-15184
             Project: Ignite
          Issue Type: Bug
          Components: general
    Affects Versions: 2.9.1, 2.7
         Environment: Reproduced on AI 2.7 and 2.9.1

Client OS: Windows Server 2012 R2 6.3 amd64

Server OS: Linux 3.10.0-1062.9.1.el7.x86_64 amd64

See the Ignite configuration in the client and server logs in the Description
            Reporter: Alexey Kukushkin


Ignite client nodes eventually lose connectivity with the cluster and never reconnect. The below logs show the server rejects the client reconnection with a warning "{color:#de350b}Close incoming connection, unknown node{color}".

It is possible the problem was caused by a severely degraded network throughput. However, the server should keep the client session for "clientFailureDetectionTimeout" (10 seconds by default) and the client should keep trying to re-connect automatically during that time. Instead, we see the server immediately says "Close incoming connection, unknown node" and the client never re-connects.

Ignite should survive temporary network problems (with duration less than clientFailureDetectionTimeout).

 

See the server and client errors and configurations in the logs below.
h3. Server

{{[05:33:00,946][INFO][main][IgniteKernal%StaticGrid] IgniteConfiguration [igniteInstanceName=StaticGrid, pubPoolSize=16, svcPoolSize=16, callbackPoolSize=16, stripedPoolSize=16, sysPoolSize=16, mgmtPoolSize=4, dataStreamerPoolSize=16, utilityCachePoolSize=16, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=16, buildIdxPoolSize=4, igniteHome=/ignite/apache-ignite-2.9.1-bin, igniteWorkDir=/ignite/apache-ignite-2.9.1-bin/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94fa3e, nodeId=e5845e09-df08-4b85-b633-f14dec8f8106, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=0, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false], segPlc=RESTART_JVM, segResolveAttempts=10, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=20000, commSpi=TcpCommunicationSpi [connectGate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@62ddbd7e, chConnPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, locHost=null, locPort=51605, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=8, selectorSpins=0, forceClientToSrvConnections=false, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@74e52ef6[Count = 1], stopping=false, metricsLsnr=null, connectionRequestor=null], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@6ed3ccb2, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@3c9754d8, addrRslvr=null, encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@3bf7ca37, tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@79079097, clientMode=false, rebalanceThreadPoolSize=4, rebalanceTimeout=10000, rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, txCfg=TransactionConfiguration [txSerEnabled=false, dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100, locHost=10.212.120.66, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=120000, sysWorkerBlockedTimeout=null, clientFailureDetectionTimeout=120000, metricsLogFreq=60000, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=16, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, sysRegionMaxSize=104857600, pageSize=0, concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=Static_Data_Region, maxSize=34359738368, initSize=1073741824, swapPath=null, pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000, persistenceEnabled=false, checkpointPageBufSize=0, lazyMemoryAllocation=true], dataRegions=null, storagePath=null, checkpointFreq=180000, lockWaitTime=10000, checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=67108864, walPath=db/wal, walArchivePath=db/wal/archive, metricsEnabled=false, walMode=LOG_ONLY, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@2925bf5b, metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, walCompactionEnabled=false, walCompactionLevel=1, checkpointReadLockTimeout=null, walPageCompression=DISABLED, walPageCompressionLevel=null], snapshotPath=snapshots, activeOnStart=true, activeOnStartPropSetFlag=false, autoActivation=true, autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, threadPoolSize=16, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100, maxActiveComputeTasksPerConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=null, commFailureRslvr=null, sqlCfg=SqlConfiguration [longQryWarnTimeout=4000, dfltQryTimeout=0, sqlQryHistSize=1000, validationEnabled=false]]}}
{{...}}
{{[15:06:29,740][WARNING][tcp-disco-msg-worker-[8817efc3 10.212.120.67:51500 crd]-#2%StaticGrid%-#62%StaticGrid%][TcpDiscoverySpi] Failing client node due to not receiving metrics updates from client node within 'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing configuration property) [timeout=120000, node=TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d, consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList [10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0, order=276, intOrder=140, lastExchangeTime=1613261110451, loc=false, ver=2.9.1#20201203-sha1:adcce517, isClient=true]]}}
{{[15:06:29,743][WARNING][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d, consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList [10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0, order=276, intOrder=140, lastExchangeTime=1613261110451, loc=false, ver=2.9.1#20201203-sha1:adcce517, isClient=true]}}
{{[15:06:29,744][INFO][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager] Topology snapshot [ver=290, locNode=e5845e09, servers=2, clients=14, state=ACTIVE, CPUs=188, offheap=64.0GB, heap=97.0GB]}}
{{[15:06:29,744][INFO][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager] ^-- Baseline [id=0, size=2, online=2, offline=0]}}
{{[15:06:29,747][INFO][grid-nio-worker-tcp-comm-7-#46%StaticGrid%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.212.120.66:51605, rmtAddr=/10.212.120.189:62773]}}
{{[15:06:29,747][INFO][exchange-worker-#66%StaticGrid%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=290, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=841c8962-775e-4109-bc74-2c003598a02d, customEvt=null, allowMerge=true, exchangeFreeSwitch=false]}}
{{{color:#de350b}[15:06:29,748][WARNING][grid-nio-worker-tcp-comm-7-#46%StaticGrid%][TcpCommunicationSpi] Close incoming connection, unknown node [nodeId=841c8962-775e-4109-bc74-2c003598a02d, ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=7, bytesRcvd=129090683, bytesSent=3949813526, bytesRcvd0=42, bytesSent0=18, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-7, igniteInstanceName=StaticGrid, finished=false, heartbeatTs=1613315189741, hashCode=1034460111, interrupted=false, runner=grid-nio-worker-tcp-comm-7-#46%StaticGrid%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=38 lim=38 cap=32768], inRecovery=null, outRecovery=null, closeSocket=true, outboundMessagesQueueSizeMetric=o.a.i.i.processors.metric.impl.LongAdderMetric@69a257d1, super=GridNioSessionImpl [locAddr=/10.212.120.66:51605, rmtAddr=/10.212.120.189:62773, createTime=1613315189741, closeTime=0, bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, sndSchedTime=1613315189741, lastSndTime=1613315189741, lastRcvTime=1613315189741, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@15582c29, directMode=true], GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]]{color}}}
h3. Client

{{[00:42:07,828][INFO][main][IgniteKernal%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89] IgniteConfiguration [igniteInstanceName=ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89, pubPoolSize=12, svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12, mgmtPoolSize=4, dataStreamerPoolSize=12, utilityCachePoolSize=12, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=12, buildIdxPoolSize=3, igniteHome=__XXXX__\apache-ignite-2.9.1-bin, igniteWorkDir=__XXXX__\work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@72d1ad2e, nodeId=04624fa9-4390-408c-8ca7-df0f11466f86, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=0, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false], segPlc=RESTART_JVM, segResolveAttempts=10, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=20000, commSpi=TcpCommunicationSpi [connectGate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@53aad5d5, chConnPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, locHost=null, locPort=51605, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=6, selectorSpins=0, forceClientToSrvConnections=false, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@46fa7c39[Count = 1], stopping=false, metricsLsnr=null, connectionRequestor=null], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@1fb700ee, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@68e965f5, addrRslvr=null, encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@6f27a732, tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@6c779568, clientMode=true, rebalanceThreadPoolSize=4, rebalanceTimeout=10000, rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, txCfg=TransactionConfiguration [txSerEnabled=false, dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100, locHost=10.212.120.180, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=120000, sysWorkerBlockedTimeout=null, clientFailureDetectionTimeout=120000, metricsLogFreq=60000, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=12, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=PlatformDotNetConfiguration [binaryCfg=null], binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=null, snapshotPath=snapshots, activeOnStart=true, activeOnStartPropSetFlag=false, autoActivation=true, autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, threadPoolSize=12, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100, maxActiveComputeTasksPerConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=null, commFailureRslvr=null, sqlCfg=SqlConfiguration [longQryWarnTimeout=4000, dfltQryTimeout=0, sqlQryHistSize=1000, validationEnabled=false]]}}
{{...}}
{{{color:#de350b}[15:06:29,839][WARNING][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d, consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList [10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0, order=276, intOrder=140, lastExchangeTime=1613263329640, loc=false, ver=2.9.1#20201203-sha1:adcce517, isClient=true]{color}}}
{{[15:06:29,839][INFO][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager] Topology snapshot [ver=290, locNode=04624fa9, servers=2, clients=14, state=ACTIVE, CPUs=188, offheap=64.0GB, heap=97.0GB]}}
{{[15:06:29,839][INFO][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager] ^-- Baseline [id=0, size=2, online=2, offline=0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)