You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by "naresh.goty" <gn...@gmail.com> on 2018/06/06 18:24:02 UTC

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Thanks.
We have enabled IPV4 JVM option in our non-production environment, found no
issue reported on segmentation. Our main concern is, the issue is happening
only in production, and we are very much interested in finding the real root
cause (we can rule out - GC pauses, CPU spikes, network latencies as the
cause is none of them).

1) please provide us with any useful tips in identifying the source of the
problem, so that we can avoid the problem altogether instead of taking a
remediation steps (of restarting JVM) if the issue happens.

2) do let us know if any timeout configurations should be increased to
mitigate the problem?

Regards
Naresh





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by Stanislav Lukyanov <st...@gmail.com>.
I see messages in the log that are not from Ignite, but which also suggest that there is some network issue:
========
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
========

In any case, given that there were no messages in the logs for ~35 minutes, it is unlikely that it is an Ignite issue – Ignite would at least print metrics or errors 
if it was running and getting any CPU time.

Stan

From: naresh.goty
Sent: 9 июня 2018 г. 18:16
To: user@ignite.apache.org
Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED)

We are still seeing the NODE SEGMENTATION issue happening to one of the node
in our production even after JVM option is enabled (
-Djava.net.preferIPv4Stack=true).

We don't see any activity reported in the logs for a period of ~30min after
node failed.
The below logs are from the failed node, and it can be observed that till
this timestamp (9:43:25) node was up and running, and after that there are
no messages till 10:19:59 in the catalina logs. Thugh node was failed at
09:44, the node segmentation was reported at 10:19AM

Right before the node failure, the metrics clearly stated that CPU and
memory utilization are very low. Long GC pauses are also not an issue.
=================================================
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=1.67%, avg=2.71%, GC=0%]
    ^-- PageMemory [pages=43003]
    ^-- Heap [used=605MB, free=85.11%, comm=4065MB]
    ^-- Non heap [used=204MB, free=-1%, comm=213MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
2018-06-09 10:19:59 [a9fc3882] warning [java  ] ... last message repeated 1
time ...
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed
to send message (connection was closed): GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800,
bytesSent0=183862, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery,
finished=false, hashCode=1045175627, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/10.40.173.21:50320,
rmtAddr=/10.40.173.14:47100, createTime=1528334304232,
closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481,
bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232,
lastSndTime=1528537460337, lastRcvTime=1528537460337, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@7d2bc944, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]
====================================================

From one of the active node, it reported that other node (above) was failed
immediately at 09:44 (as reported in the logs below)

Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=b392f3c6, name=delivery, uptime=56:18:06.778]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=9.17%, avg=9%, GC=0%]
    ^-- PageMemory [pages=41038]
    ^-- Heap [used=3167MB, free=22.42%, comm=4082MB]
    ^-- Non heap [used=202MB, free=-1%, comm=210MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=9096, reusePages=4659]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=240]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9bbe0362-00e0-4f16-abf0-52de5c62c51d, addrs=[10.40.173.21, 127.0.0.1],
sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=28,
intOrder=18, lastExchangeTime=1528334738846, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=40, servers=8, clients=0, CPUs=16, heap=26.0GB]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false, evt=NODE_FAILED,
evtNode=9bbe0362-00e0-4f16-abf0-52de5c62c51d, customEvt=null,
allowMerge=true]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], waitTime=0ms,
futInfo=NA]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Received full message, will finish exchange
[node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion
[topVer=40, minorTopVer=0]]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finish exchange future [startVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=40, minorTopVer=0],
err=null]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info


================================================================

Any advice in this issue would be appreciated?

-Naresh




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by VeenaMithare <v....@cmcmarkets.com>.
Hi Evgenii,

Thank you for the reply and suggestion.

regards,
Veena.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by Evgenii Zhuravlev <e....@gmail.com>.
Hi,

Segmentation plugin won't help with the issue itself. If you have a long GC
pause, it means that node is unresponsive for all this time. If you have GC
pause longer than 10 seconds, node will be dropped from the cluster(by
default). If you have long GC pauses, probably your load too big for your
configuration and it makes sense to increase heap or add more nodes to the
cluster.

Evgenii

вт, 14 апр. 2020 г. в 03:50, VeenaMithare <v....@cmcmarkets.com>:

> Hi Dmitry,
>
> Would having a segmentation plugin help to resolve segmentation due to GC
> pauses ?
>
> Or is the best resolution for long GC pauses is to resolve it and get the
> GC
> pauses to be within the failure detection timeout ?
>
> regards,
> Veena.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by VeenaMithare <v....@cmcmarkets.com>.
Hi Dmitry,

Would having a segmentation plugin help to resolve segmentation due to GC
pauses ?

Or is the best resolution for long GC pauses is to resolve it and get the GC
pauses to be within the failure detection timeout ?

regards,
Veena.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by dkarachentsev <dk...@gridgain.com>.
Naresh,

GC logs show not only GC pause, but system pause as well. Try these
parameters:
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime 

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by "naresh.goty" <gn...@gmail.com>.
Thanks for the recommendation, but we already identified and addressed the
issues with GC pauses in JVM, and now we could not find any long gc activity
during the time of node failure due to network segmentation. (please find
the attached screenshot of GC activity from dynatrace agent).

From the screenshot, there is only young generation gc collection and that
too < 100ms.

We can still enable gc logs and i strongly suspect the issue is beyond jvm
pauses.

Thanks,
Naresh

<http://apache-ignite-users.70518.x6.nabble.com/file/t1286/gc_activity.png> 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by dkarachentsev <dk...@gridgain.com>.
Hi Naresh,

Actually any JVM process hang could lead to segmentation. If some node is
not responsive for longer than failureDetectionTimeout, it will be kicked
off from the cluster to prevent all over grid performance degradation.

It works on following scenario. Let's say we have 3 nodes in a ring: n1 ->
n2 -> n3. Over ring go some discovery messages along with metrics and
connection checks with predefined interval. Node 2 start experiencing issues
like GC pause or OS failures that forces process to stop. For that time node
1 is unable to send message to n2 (it doesn't receive ack). n1 waits for
failureDetectionTimeout and establishes connection to n3: n1 -> n3; when n2
is not connected. 

Cluster treated n2 as failed. When n2 comes back it tries to connect to n3
and send message across ring, when it receives message that it's out of
grid. For n2 that means it was segmented and best what it could do is stop.

To check if there were large JVM or system pauses, you may enable GC logs.
If they longer than failureDetectionTimeout, then node will be segmented.

The best way would be to solve pauses, but like a workaround - increase
timeout.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by "naresh.goty" <gn...@gmail.com>.
Hi Dmitry,

We are again seeing segmentation failure in one of the node of our prod env.
This time we did not run jmap, but still node failed.

-> CPU, memory utilization and network are in optimal state.

We observed that there are page faults in memory at the same time of
segmentation failure, as reported by dynatrace agent (attached screenshot).

Can you please confirm if page faults could result in network segmentation
in a node?
I think, we see page faults in a node, but not always result in segmentation
failure.


Logs from Failed Agent:
================================
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=3f568bb8, name=delivery, uptime=24:31:12.859]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=7%, avg=9.06%, GC=0%]
    ^-- PageMemory [pages=30244]
    ^-- Heap [used=3184MB, free=22.09%, comm=4087MB]
    ^-- Non heap [used=213MB, free=-1%, comm=222MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4879, reusePages=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: TCP discovery accepted incoming connection [rmtAddr=/10.40.173.14,
rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: TCP discovery spawning a new thread for connection
[rmtAddr=/10.40.173.14, rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Started serving remote node connection [rmtAddr=/10.40.173.14:33762,
rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node is out of topology (probably, due to short-time network
problems).
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Local node SEGMENTED: TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=152978
6434361, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished serving remote node connection [rmtAddr=/10.40.173.14:33762,
rmtPort=33762
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished serving remote node connection [rmtAddr=/10.40.173.41:52584,
rmtPort=52584
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Stopping local node according to configured segmentation policy.
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1529050123714,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Command protocol successfully stopped: TCP binary
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=28.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=a26de809-dde1-41b8-87a3-d5576851a0be, addrs=[10.40.173.56, 127.0.0.1],
sockAddrs=[/10.40.173.56:47500, /127.0.0.1:47500], discPort=47500, order=23,
intOrder=16, lastExchangeTime=1529050123735,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=57, servers=7, clients=0, CPUs=14, heap=26.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=910ea19f-af5c-4745-a035-b24a3bb48206, addrs=[10.40.173.88, 127.0.0.1],
sockAddrs=[/10.40.173.88:47500, /127.0.0.1:47500], discPort=47500, order=25,
intOrder=17, lastExchangeTime=1529050123735,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=58, servers=6, clients=0, CPUs=12, heap=24.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=17f3ba9c-e32e-47e4-9ca2-136338d8c4ac, addrs=[10.40.173.39, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.39:47500], discPort=47500, order=30,
intOrder=19, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=59, servers=5, clients=0, CPUs=10, heap=20.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=b392f3c6-84fd-4cd9-a695-92d1ef3b4262, addrs=[10.40.173.11, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.11:47500], discPort=47500, order=34,
intOrder=21, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=60, servers=4, clients=0, CPUs=8, heap=16.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=e2eb6d96-e60e-4643-ac7a-2b750888079e, addrs=[10.40.173.21, 127.0.0.1],
sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=41,
intOrder=25, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=61, servers=3, clients=0, CPUs=6, heap=12.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=8975781d-ac95-49eb-9f17-4be2d3374b15, addrs=[10.40.173.74, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.74:47500], discPort=47500, order=45,
intOrder=27, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=62, servers=2, clients=0, CPUs=4, heap=8.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning


==================================


Logs from Coordinator node (or reported agent)

Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=96268498, name=delivery, uptime=16:54:44.560]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=6.5%, avg=9.07%, GC=0%]
    ^-- PageMemory [pages=33192]
    ^-- Heap [used=3396MB, free=16.88%, comm=4086MB]
    ^-- Non heap [used=219MB, free=-1%, comm=228MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=5624, reusePages=63]
Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=96268498, name=delivery, uptime=16:55:44.622]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=7.83%, avg=9.07%, GC=0%]
    ^-- PageMemory [pages=33192]
    ^-- Heap [used=3188MB, free=21.98%, comm=4086MB]
    ^-- Non heap [used=219MB, free=-1%, comm=228MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=5624, reusePages=63]
Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Timed out waiting for message delivery receipt (most probably, the
reason is in long GC pauses on remote node; consider tuning GC and
increasing 'ackTimeout' configuration property). Will retry to send message
with increased timeout [currentTimeout=9990, rmtAddr=/10.40.173.78:47500,
rmtPort=47500]
Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Failed to send message to next node
[msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage
[sndNodeId=8975781d-ac95-49eb-9f17-4be2d3374b15,
id=bd526e6e361-9165f32c-9765-49d7-8856-5b77b0bded6d,
verifierNodeId=9165f32c-9765-49d7-8856-5b77b0bded6d, topVer=0, pendingIdx=0,
failedNodes=null, isClient=false]], next=TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=1529050142836, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], errMsg=Failed to send
message to next node [msg=TcpDiscoveryMetricsUpdateMessage
[super=TcpDiscoveryAbstractMessage
[sndNodeId=8975781d-ac95-49eb-9f17-4be2d3374b15,
id=bd526e6e361-9165f32c-9765-49d7-8856-5b77b0bded6d,
verifierNodeId=9165f32c-9765-49d7-8856-5b77b0bded6d, topVer=0, pendingIdx=0,
failedNodes=null, isClient=false]], next=ClusterNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, order=54, addr=[10.40.173.78,
127.0.0.1], daemon=false]]]
Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Local node has detected failed nodes and started cluster-wide
procedure. To speed up failure detection please see 'Failure Detection'
section under javadoc for 'TcpDiscoverySpi'
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=1529050142836, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=26.0GB]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=56,
minorTopVer=0], crd=false, evt=NODE_FAILED,
evtNode=3f568bb8-813d-47f7-b8da-4ecbff3e9753, customEvt=null,
allowMerge=true]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], waitTime=0ms,
futInfo=NA]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=56,
minorTopVer=0], crd=false]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Received full message, will finish exchange
[node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion
[topVer=56, minorTopVer=0]]
===================================================

<http://apache-ignite-users.70518.x6.nabble.com/file/t1286/pageFaults.png> 

Thanks
Naresh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by "naresh.goty" <gn...@gmail.com>.
Thanks Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by dkarachentsev <dk...@gridgain.com>.
Hi Naresh,

Recommendation will be the same: increase failureDetectionTimeout unless
nodes stop segmenting or use gdb (or remove "live" option from jmap command
to skip full GC).

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by "naresh.goty" <gn...@gmail.com>.
thanks Stan.

We have enabled actionable alerts to generate dumps when memory utilization
reaches a certain threshold on all cache nodes. Whenever the alert is
triggered, cache node is getting segmented. So, essentially we cannot take
dumps on a live node. Even increasing socket timeout may not work, as
heapdump process is taking minutes to complete and node is unresponsive
during that time. How to get around this issue? (Could not evaluate gdb
yet).

Regards,
Naresh





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by Stanislav Lukyanov <st...@gmail.com>.
Quick googling suggests to use gdb:
https://stackoverflow.com/a/37375351/4153863

Again, Ignite doesn’t play any role here – whatever works for any Java application should work for Ignite as well.

Stan

From: naresh.goty
Sent: 11 июня 2018 г. 20:56
To: user@ignite.apache.org
Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED)

Hi All,

We found that, when jmap utility is triggered to generate heapdumps in the
application node, NODE_SEGMENTATION event is fired from the node. Can some
one please let us know how to safely take heapdumps in a live node with
ignite cache running in embedded node without crashing the node due to
segmentation failure ?

Regards,
Naresh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by "naresh.goty" <gn...@gmail.com>.
Hi All,

We found that, when jmap utility is triggered to generate heapdumps in the
application node, NODE_SEGMENTATION event is fired from the node. Can some
one please let us know how to safely take heapdumps in a live node with
ignite cache running in embedded node without crashing the node due to
segmentation failure ?

Regards,
Naresh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by "naresh.goty" <gn...@gmail.com>.
We are still seeing the NODE SEGMENTATION issue happening to one of the node
in our production even after JVM option is enabled (
-Djava.net.preferIPv4Stack=true).

We don't see any activity reported in the logs for a period of ~30min after
node failed.
The below logs are from the failed node, and it can be observed that till
this timestamp (9:43:25) node was up and running, and after that there are
no messages till 10:19:59 in the catalina logs. Thugh node was failed at
09:44, the node segmentation was reported at 10:19AM

Right before the node failure, the metrics clearly stated that CPU and
memory utilization are very low. Long GC pauses are also not an issue.
=================================================
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=1.67%, avg=2.71%, GC=0%]
    ^-- PageMemory [pages=43003]
    ^-- Heap [used=605MB, free=85.11%, comm=4065MB]
    ^-- Non heap [used=204MB, free=-1%, comm=213MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
2018-06-09 10:19:59 [a9fc3882] warning [java  ] ... last message repeated 1
time ...
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed
to send message (connection was closed): GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800,
bytesSent0=183862, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery,
finished=false, hashCode=1045175627, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/10.40.173.21:50320,
rmtAddr=/10.40.173.14:47100, createTime=1528334304232,
closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481,
bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232,
lastSndTime=1528537460337, lastRcvTime=1528537460337, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@7d2bc944, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]
====================================================

From one of the active node, it reported that other node (above) was failed
immediately at 09:44 (as reported in the logs below)

Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=b392f3c6, name=delivery, uptime=56:18:06.778]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=9.17%, avg=9%, GC=0%]
    ^-- PageMemory [pages=41038]
    ^-- Heap [used=3167MB, free=22.42%, comm=4082MB]
    ^-- Non heap [used=202MB, free=-1%, comm=210MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=9096, reusePages=4659]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=240]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9bbe0362-00e0-4f16-abf0-52de5c62c51d, addrs=[10.40.173.21, 127.0.0.1],
sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=28,
intOrder=18, lastExchangeTime=1528334738846, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=40, servers=8, clients=0, CPUs=16, heap=26.0GB]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false, evt=NODE_FAILED,
evtNode=9bbe0362-00e0-4f16-abf0-52de5c62c51d, customEvt=null,
allowMerge=true]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], waitTime=0ms,
futInfo=NA]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Received full message, will finish exchange
[node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion
[topVer=40, minorTopVer=0]]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finish exchange future [startVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=40, minorTopVer=0],
err=null]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info


================================================================

Any advice in this issue would be appreciated?

-Naresh




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Node failure - Node out of topology (SEGMENTED)

Posted by Andrey Mashenkov <an...@gmail.com>.
Hi,

Seems, there is a bug with IPv6 usage [1]. This has to be investigated.
Also, there is a discussion [2].

[1] https://issues.apache.org/jira/browse/IGNITE-6503
[2]
http://apache-ignite-developers.2346864.n4.nabble.com/Issues-if-Djava-net-preferIPv4Stack-true-is-not-set-td22372.html

On Wed, Jun 6, 2018 at 9:24 PM, naresh.goty <gn...@gmail.com> wrote:

> Thanks.
> We have enabled IPV4 JVM option in our non-production environment, found no
> issue reported on segmentation. Our main concern is, the issue is happening
> only in production, and we are very much interested in finding the real
> root
> cause (we can rule out - GC pauses, CPU spikes, network latencies as the
> cause is none of them).
>
> 1) please provide us with any useful tips in identifying the source of the
> problem, so that we can avoid the problem altogether instead of taking a
> remediation steps (of restarting JVM) if the issue happens.
>
> 2) do let us know if any timeout configurations should be increased to
> mitigate the problem?
>
> Regards
> Naresh
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov