You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by yucigou <yu...@gmail.com> on 2017/02/02 16:39:22 UTC

Re: Local node seems to be disconnected from topology (failure detection timeout is reached)

A follow-up question about recovery.

Node ves-hx-40 was frozen for about a minute due to VM backup, and was
considered failed by the cluster.

Then ves-hx-40 woke up after the VM backup, and found itself being
disconnected from topoloyg (see below the logs). It then stopped itself.

[23:34:03,752][INFO ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Local
node seems to be disconnected from topology (failure detection timeout is
reached) [failureDetectionTimeout=10000, connCheckFreq=3333] 
[23:34:03,783][WARN ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node
is out of topology (probably, due to short-time network problems). 
[23:34:03,786][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode
[id=9a069f70-d49d-472e-9771-7ac2353e751f, addrs=[10.3.0.64, 127.0.0.1],
sockAddrs=[ves-hx-40.ebi.ac.uk/10.3.0.64:47500, /10.3.0.64:47500,
/127.0.0.1:47500], discPort=47500, order=56, intOrder=29,
lastExchangeTime=1470350043783, loc=true, ver=1.6.0#20160518-sha1:0b22c45b,
isClient=false] 
[23:34:03,819][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager]
Stopping local node according to configured segmentation policy. 

I understand that in such situations Apache Ignite would stop the local node
according to the segmentation policy.

My question is, why Apache Ignite does not give an option to try to
reconnect to the cluster, in stead of just stopping the local node (or doing
nothing, or restart JVM)? 

I think it is a reasonable policy option, that is, to regard the
disconnected local node as a new potential member of the cluster, clear all
of its local caches and states, and then rejoin the cluster.

Thanks.

Yuci



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Local-node-seems-to-be-disconnected-from-topology-failure-detection-timeout-is-reached-tp6797p10386.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Local node seems to be disconnected from topology (failure detection timeout is reached)

Posted by vkulichenko <va...@gmail.com>.

Yuci,

RESTART_JVM policy restarts the whole JVM (surprisingly :) ), so it works
only for standalone nodes started with ignite.sh script.

In your case you can use NOOP policy, listen to EVT_NODE_SEGMENTED event and
restart the node in a listener.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Local-node-seems-to-be-disconnected-from-topology-failure-detection-timeout-is-reached-tp6797p10459.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Local node seems to be disconnected from topology (failure detection timeout is reached)

Posted by yucigou <yu...@gmail.com>.

Hi Val,

Thanks for the explanation for not allowing re-connection of server nodes. 

Regarding the RESTART_JVM policy, this policy does not work for our web
application. The reason is that in our web application Ignite is started by
the servlet context listener
org.apache.ignite.startup.servlet.ServletContextListenerStartup.

And according to the Ignite documentation, the RESTART_JVM policy will work
only if Ignite is started with CommandLineStartup via standard
ignite.{sh|bat} shell script.

https://ignite.apache.org/releases/mobile/org/apache/ignite/plugin/segmentation/SegmentationPolicy.html#RESTART_JVM

Wonder if it is possible to make the RESTART_JVM policy also work for Ignite
being started by the servlet context listener?

Thanks,
Yuci



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Local-node-seems-to-be-disconnected-from-topology-failure-detection-timeout-is-reached-tp6797p10452.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Local node seems to be disconnected from topology (failure detection timeout is reached)

Posted by vkulichenko <va...@gmail.com>.

Hi Yuci,

We do not allow reconnection of server nodes because it's dangerous for data
consistency. Segmented node must rejoin topology as a new one. There is
RESTART_JVM policy which restarts node immediately.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Local-node-seems-to-be-disconnected-from-topology-failure-detection-timeout-is-reached-tp6797p10401.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.