You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by crenique <cr...@gmail.com> on 2018/05/10 19:19:24 UTC

Node FAILED

Hi,

We are running ignite 2.3 grid in 40 nodes cluster.
1 ignite server, 2 ignite clients per each node.
so, total
 40 ignite servers
 80 ignite clients

Topology snapshot [ver=120, servers=40, clients=80, CPUs=640, heap=560.0GB]


It runs ok upto about 12 hours uptime, and total partitioned cache size 1
million.
But staring from the middle night, ignite server seems getting kicked out of
topology by "Node FAILED" message.


Can you please give me an insight what could be the cause ?
(attache the full log files, and details below)


Thanks
Sam

------------------------------

VM total memory: 10.0GB
VM arguments: [-Duser.timezone=UTC, -DIGNITE_QUIET=false,
-Djava.net.preferIPv4Stack=true, -Djava.awt.headless=true, -Xms10g, -Xmx10g,
-XX:+AlwaysPreTouch, -XX:+UseG1GC, -XX:+ScavengeBeforeFullGC,
-XX:+DisableExplicitGC]

ignite-8a2ce76e__0.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-8a2ce76e__0.log>  
ignite-8a2ce76e__1.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-8a2ce76e__1.log>  
ignite-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-config.xml>  



CacheConfiguration cacheConfig = new CacheConfiguration(cacheName,
queryEntity)
{
	SqlSchema = "PUBLIC",
	Backups = 2,
	DataRegionName = "Default_Region",

	CopyOnRead = false,
	EvictionPolicy = new LruEvictionPolicy
	{
		MaxSize = 100000, // Maximum number of entries that will be stored in Java
heap. 
		MaxMemorySize = 1024 * 1024 * 1024 * 2 // (4 GB)
	}
};





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Posted by crenique <cr...@gmail.com>.

Ok, i'll try increasing failureDetectionTimeout to 3 - 5 mins.

For the heapdump, I tried jmap/jcmd to get a java heapdump for the frozen
ignite dotnet process, but it didn't respond and jconsole also failed to
attach.
Instead I got the dotnet dump. 
Maybe GC Finalizer callback somehow got into a deadlock ?

<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/DotnetProcessDmp-node26.jpg> 





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Posted by akurbanov <an...@gmail.com>.

Hello,

Did you tried to use larger failureDetectionTimeout value? Try setting it to
3-5 minutes.

Is there a chance you could provide heap dump to check for memory consumers?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Posted by crenique <cr...@gmail.com>.

Thanks Anton for the information.
Here I re-summarized and added more details and both server and client logs
when the incident happened.


[Cluster configuration]

Windows Azure VM scale set
Windows Server 2016 10.0 amd64 VM x 40 nodes

VM information: Java(TM) SE Runtime Environment 1.8.0_162-b12 Oracle
Corporation Java HotSpot(TM) 64-Bit Server VM 25.162-b12

1 ignite server, 2 ignite clients per each node
Full topology is, 2018/05/09 17:53:30.564 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=120,
servers=40, clients=80, CPUs=640, heap=560.0GB]


[Ignite cache configuration]

.NET ignite 2.3

ignite-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-config.xml>  
var lifecycleHandler = new LifecycleHandler();
IgniteConfiguration igniteConfiguration = new IgniteConfiguration()
{
	SpringConfigUrl = "ignite-config.xml",
	ClientMode = clientMode,
	JvmOptions = jvmOptions,
	LifecycleHandlers = new[] { lifecycleHandler },
	BinaryConfiguration = binaryConfiguration
};

m_ignite = Ignition.Start(igniteConfiguration);
m_ignite.Stopping += async (sender, args) => 
{
	Console.WriteLine(">>> Ignite node stopping ...");  /<= No console log has
been printed/
};

m_ignite.Stopped += async (sender, args) =>
{
	Console.WriteLine(">>> Ignite node stopped."); /<= No console log has been
printed/
};


CacheConfiguration cacheConfig = new CacheConfiguration(cacheName,
queryEntity)
{
	SqlSchema = "PUBLIC",
	Backups = 2,
	DataRegionName = "Default_Region",

	CopyOnRead = false,
	EvictionPolicy = new LruEvictionPolicy
	{
		MaxSize = 100000,
		MaxMemorySize = 1024 * 1024 * 1024 * 2
	}
};

Key type: string
Value type: BinaryObject

One partitioned cache upto 1 million
Average cache entry size is between 2kbytes and 6kbytes +


[Problem]

Ignite server process has been dropping out of topology one by one over
time.

2018/05/09 17:53:30.564 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=120,
servers=40, clients=80, CPUs=640, heap=560.0GB]
...
2018/05/10 08:20:44.254 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=123,
servers=37, clients=80, CPUs=640, heap=530.0GB]
...
2018/05/10 11:29:43.461 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=128,
servers=32, clients=80, CPUs=640, heap=480.0GB]
...
2018/05/10 19:30:08.519 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=139,
servers=21, clients=80, CPUs=640, heap=370.0GB]


Now we lost 19 ignite servers out of 40 total from the topology.
It seems like the ignite dotnet server progress got frozen when an ignite
server was dropped.
ignite-jstack-node26.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-jstack-node26.txt>  


[JVM Options]

SERVER

"-Duser.timezone=UTC",
"-DIGNITE_QUIET=false",
"-Djava.net.preferIPv4Stack=true",
"-Djava.awt.headless=true",
"-Xms10g",
"-Xmx10g",
"-XX:+AlwaysPreTouch",
"-XX:+UseG1GC",
"-XX:+ScavengeBeforeFullGC",
"-XX:+DisableExplicitGC"

CLIENT

"-Duser.timezone=UTC",
"-DIGNITE_QUIET=false",
"-Djava.net.preferIPv4Stack=true",
"-Djava.awt.headless=true",
"-Xms2g",
"-Xmx2g",
"-XX:+AlwaysPreTouch",
"-XX:+UseG1GC",
"-XX:+ScavengeBeforeFullGC",
"-XX:+DisableExplicitGC"


[Logs]

SERVER

2018/05/10 18:49:56.066 [INFO ][grid-timeout-worker-#39][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=8a2ce76e, uptime=23:20:53.173]
    ^-- H/N/C [hosts=40, nodes=104, CPUs=640]
    ^-- CPU [cur=2.13%, avg=1.23%, GC=0%]
    ^-- PageMemory [pages=37787]
    ^-- Heap [used=2571MB, free=74.89%, comm=10240MB]
    ^-- Non heap [used=77MB, free=-1%, comm=80MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=13]
2018/05/10 18:49:56.343 [INFO ][grid-timeout-worker-#39][IgniteKernal]
FreeList [name=null, buckets=256, dataPages=35238, reusePages=12]

/ PROCESS FROZEN HERE AT 2018/05/10 18:50 !!!
 The dotnet ignite server process is still alive, but metrics stopped, no 
other logs, no CPU usage since. /

ignite-server-node26.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-server-node26.zip>  


CLIENT

2018/05/10 18:50:29.383 [WARN ][Thread-2561][IgniteH2Indexing] Failed to
send message [node=TcpDiscoveryNode
[id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500,
CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500],
discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894136807,
loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false],
msg=GridQueryCancelRequest [qryReqId=5208317], errMsg=Failed to send message
(node left topology): TcpDiscoveryNode
[id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500,
CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500],
discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894136807,
loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false]]

ignite-client-node08.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-client-node08.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Posted by akurbanov <an...@gmail.com>.

Hi,

Seems like a STW/full GC happens on your nodes for quite a long time or some
network issues are there, please try to set failureDetectionTimeout property
in IgniteConfiguration to some  and

<category name="org.apache.ignite.spi.communication.tcp">
    <level value="DEBUG"/>
</category>

in logger and see if this is reproduced.

Also having you see "Received incoming connection when already connected to
this node, rejecting" in logs, try adding usePairedConnections to true in
TcpCommunicationSpi.

Can you also attach log from the node that is failing to connect?

Best regards.
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Posted by crenique <cr...@gmail.com>.

jstack dump attached.

ignite-jstack.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-jstack.txt>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Posted by crenique <cr...@gmail.com>.

It seems like ignite server process frozen.
The dotnet ignite server process is still alive, but metrics stopped, no
other logs, no CPU usage.

This is the last error message in ignite server logs.

* server log (in attached logfile)
2018/05/10 18:49:56.343 [INFO ][grid-timeout-worker-#39][IgniteKernal]
FreeList [name=null, buckets=256, dataPages=35238, reusePages=12]


* client log
 errMsg=Failed to send message (node left topology): TcpDiscoveryNode
[id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500,
CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500],
discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894137302,
loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false]]



I expected to see "Ignite Ver xxx stopped OK" message if something went
wrong and it kicked out of topology and stop0 is called.  But the ignite
server just stopped responding with no logs.


What could be a cause of this symptom ?





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/