You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by "naresh.goty" <gn...@gmail.com> on 2018/06/06 18:24:02 UTC

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Thanks.
We have enabled IPV4 JVM option in our non-production environment, found no
issue reported on segmentation. Our main concern is, the issue is happening
only in production, and we are very much interested in finding the real root
cause (we can rule out - GC pauses, CPU spikes, network latencies as the
cause is none of them).

1) please provide us with any useful tips in identifying the source of the
problem, so that we can avoid the problem altogether instead of taking a
remediation steps (of restarting JVM) if the issue happens.

2) do let us know if any timeout configurations should be increased to
mitigate the problem?

Regards
Naresh





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by Stanislav Lukyanov <st...@gmail.com>.

I see messages in the log that are not from Ignite, but which also suggest that there is some network issue:
========
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
========

In any case, given that there were no messages in the logs for ~35 minutes, it is unlikely that it is an Ignite issue – Ignite would at least print metrics or errors 
if it was running and getting any CPU time.

Stan

From: naresh.goty
Sent: 9 июня 2018 г. 18:16
To: user@ignite.apache.org
Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED)

We are still seeing the NODE SEGMENTATION issue happening to one of the node
in our production even after JVM option is enabled (
-Djava.net.preferIPv4Stack=true).

We don't see any activity reported in the logs for a period of ~30min after
node failed.
The below logs are from the failed node, and it can be observed that till
this timestamp (9:43:25) node was up and running, and after that there are
no messages till 10:19:59 in the catalina logs. Thugh node was failed at
09:44, the node segmentation was reported at 10:19AM

Right before the node failure, the metrics clearly stated that CPU and
memory utilization are very low. Long GC pauses are also not an issue.
=================================================
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=1.67%, avg=2.71%, GC=0%]
    ^-- PageMemory [pages=43003]
    ^-- Heap [used=605MB, free=85.11%, comm=4065MB]
    ^-- Non heap [used=204MB, free=-1%, comm=213MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
2018-06-09 10:19:59 [a9fc3882] warning [java  ] ... last message repeated 1
time ...
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed
to send message (connection was closed): GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800,
bytesSent0=183862, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery,
finished=false, hashCode=1045175627, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/10.40.173.21:50320,
rmtAddr=/10.40.173.14:47100, createTime=1528334304232,
closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481,
bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232,
lastSndTime=1528537460337, lastRcvTime=1528537460337, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@7d2bc944, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]
====================================================

From one of the active node, it reported that other node (above) was failed
immediately at 09:44 (as reported in the logs below)

Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=b392f3c6, name=delivery, uptime=56:18:06.778]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=9.17%, avg=9%, GC=0%]
    ^-- PageMemory [pages=41038]
    ^-- Heap [used=3167MB, free=22.42%, comm=4082MB]
    ^-- Non heap [used=202MB, free=-1%, comm=210MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=9096, reusePages=4659]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=240]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9bbe0362-00e0-4f16-abf0-52de5c62c51d, addrs=[10.40.173.21, 127.0.0.1],
sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=28,
intOrder=18, lastExchangeTime=1528334738846, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=40, servers=8, clients=0, CPUs=16, heap=26.0GB]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false, evt=NODE_FAILED,
evtNode=9bbe0362-00e0-4f16-abf0-52de5c62c51d, customEvt=null,
allowMerge=true]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], waitTime=0ms,
futInfo=NA]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Received full message, will finish exchange
[node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion
[topVer=40, minorTopVer=0]]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finish exchange future [startVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=40, minorTopVer=0],
err=null]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info


================================================================

Any advice in this issue would be appreciated?

-Naresh




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by VeenaMithare <v....@cmcmarkets.com>.

Hi Evgenii,

Thank you for the reply and suggestion.

regards,
Veena.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Hi,

Segmentation plugin won't help with the issue itself. If you have a long GC
pause, it means that node is unresponsive for all this time. If you have GC
pause longer than 10 seconds, node will be dropped from the cluster(by
default). If you have long GC pauses, probably your load too big for your
configuration and it makes sense to increase heap or add more nodes to the
cluster.

Evgenii

вт, 14 апр. 2020 г. в 03:50, VeenaMithare <v....@cmcmarkets.com>:

> Hi Dmitry,
>
> Would having a segmentation plugin help to resolve segmentation due to GC
> pauses ?
>
> Or is the best resolution for long GC pauses is to resolve it and get the
> GC
> pauses to be within the failure detection timeout ?
>
> regards,
> Veena.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by VeenaMithare <v....@cmcmarkets.com>.

Hi Dmitry,

Would having a segmentation plugin help to resolve segmentation due to GC
pauses ?

Or is the best resolution for long GC pauses is to resolve it and get the GC
pauses to be within the failure detection timeout ?

regards,
Veena.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/