You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by xero <mp...@gmail.com> on 2020/10/27 19:19:43 UTC

client hangs forever trying to join the cluster (ClientImp joinLatch.await())

Hello,
we recently had a production incident in which our application got stuck
connecting to the cluster. The *IgnitionEx start0* method was blocked for
more than 24 hours waiting for that latch to be notified, but that never
happened. Finally, the container was restarted in order to recover the
service.

this is the stacktrace of that thread
<http://apache-ignite-users.70518.x6.nabble.com/file/t1923/Screen_Shot_2020-10-27_at_3.png> 


This happened close to an Ignite server node restart due to SEGMENTATION.
These are some lines that I extracted from the logs of that server that may
be relevant (not sure tho).

/2020-10-22T13:33:03.348+00:00 a5912bf99152 ignite:
tcp-disco-msg-worker-#2|WARN |o.a.i.s.d.tcp.TcpDiscoverySpi|Node is out of
topology (probably, due to short-time network problems).

2020-10-22T13:33:03.349+00:00 a5912bf99152 ignite:
disco-event-worker-#66|WARN |o.a.i.i.m.d.GridDiscoveryManager|Local node
SEGMENTED: TcpDiscoveryNode [id=2296e9a7-96d6-44d9-af3b-4e22e33261ea,
addrs=[10.133.3.6, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
a5912bf99152/10.133.3.6:47500], discPort=47500, order=276, intOrder=142,
lastExchangeTime=1603373583342, loc=true, ver=2.7.6#20190911-sha1:21f7ca41,
isClient=false]

2020-10-22T13:33:04.232+00:00 a5912bf99152 ignite:
node-stopper|ERROR|ROOT|Stopping local node on Ignite failure:
[failureCtx=FailureContext [type=SEGMENTATION, err=null]]

2020-10-22T13:33:09.312+00:00 a5912bf99152 ignite: exchange-worker-#67|INFO
|o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture|Coordinator changed, send
partitions to new coordinator [ver=AffinityTopologyVersion [topVer=284,
minorTopVer=0], crd=6293444a-0f6d-4946-b357-85a6d195a244,
newCrd=ad701f62-28ee-4028-8981-8a19dd5de1f8]

2020-10-22T13:33:09.313+00:00 a5912bf99152 ignite: exchange-worker-#67|INFO
|o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture|Coordinator failed, node
is new coordinator [ver=AffinityTopologyVersion [topVer=284, minorTopVer=0],
prev=ad701f62-28ee-4028-8981-8a19dd5de1f8
]/


During those 24 hours there are hundreds of messages about the
SYSTEM_WORKER_BLOCKED but that event is ignored by the failure handler:

/2020-10-22 06:33:12.732 PDT [grid-timeout-worker-#119]  ERROR root                           
-Critical system error detected. Will be handled accordingly to configured
handler [hnd=ExpressIgnitionFailureHandler [], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=tcp-client-disco-msg-worker, igniteInstanceName=null, finished=false,
heartbeatTs=1603373580648]]]/


Based on the logs, it seems that there was a network glitch during that
interval at the same time the client was trying to join the cluster.
Do you think these events can be related to the blocked start0 method? is it
possible that the glitch/coordinator-change is generating the join
request/response to get lost making that latch to block forever?

Any suggestions to handle this case? (any 2.8.1 or 2.9 change that may
apply?)
Thanks for your time. 












--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: client hangs forever trying to join the cluster (ClientImp joinLatch.await())

Posted by akorensh <al...@gmail.com>.
The latch not being notified is a bit odd. If this happens regularly, send
the steps to reproduce.
It might be network related.

See if you could monitor the network to determine whether all hosts have
continual access to each other.
If you see gaps, adjust failureDetectionTimeout (or other network settings)
as needed.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: client hangs forever trying to join the cluster (ClientImp joinLatch.await())

Posted by xero <mp...@gmail.com>.
Hi, 
Thanks for the recommendations.

in this case, both server and client didn't show memory issues (heap and
available memory in the container). The GC pauses were very short too. 

The configured timeouts are default:
clientFailureDetectionTimeout = 30000
failureDetectionTimeout = 10000

The latch did not get notified in more than 24hs and the timeout is 30
seconds. How can this explain the node hanging for a day? That's why I was
thinking about a message that got lost.  

Do you think that using a different value for those parameters would avoid
this scenario? 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: client hangs forever trying to join the cluster (ClientImp joinLatch.await())

Posted by akorensh <al...@gmail.com>.
Hi,
  It does look like nework issues.
  You might want to adjust network timeout to fit your use case:
https://ignite.apache.org/docs/latest/clustering/network-configuration#connection-timeouts
  The main setting is IgniteConfiguration.failureDetectionTimeout

  more useful details here:
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html


  This might also be caused by GC issues. Check how much memory you've
allocated to your process,
  collect GC logs and check what they say.
 
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/troubleshooting#detailed-gc-logs
 
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning
  

  Information on critial worker blocked message:
 
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions#failures-handling

Thanks, Alex



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/