You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by scottmf <sc...@gmail.com> on 2019/04/09 21:00:13 UTC

waiting for partition map exchange questions

hi All,
I just encountered a situation in my k8s cluster where I'm running a 3 node
ignite setup, with 2 client nodes.  The server nodes have 8GB of off-heap
per node, 8GB JVM (with g1gc) and 4GB of OS memory without persistence.  I'm
using Ignite 2.7.

One of the ignite nodes got killed due to some issue in the cluster.  I
believe this was the sequence of events:

-> Data Eviction spikes on two nodes in the cluster (NODE A & B), then 15
mins later..
-> NODE C goes down
-> NODE D comes up (to replace node C)
--> NODE D attempts a PME
--> NODE B log = "Local node has detected failed nodes and started
cluster-wide procedure"
--> During PME the Ignite JVM on NODE D is restarted since it was taking too
long and was killed by a k8s liveness probe.
--> NODE D comes back up and attempts another PME
---> Note: i see these messages from all the nodes "First 10 pending
exchange futures [total=2]"  The total keeps ascending.  The highest number
I see is total=14.
---> NODE D log = "Failed to wait for initial partition map exchange.
Possible reasons are:..."
---> NODE B log = "Possible starvation in striped pool. queue=[], dealock =
false, Completed: 991189487 ..."
---> NODE A log = "Client node considered as unreachable and will be dropped
from cluster, because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=c5a92006-c29a-4a37-b149-7ec7855dc401,
clientFailureDetectionTimeout=30000]"

NOTE that NODE D kept restarting due to a k8s liveness probe.  I think I'm
going to remove the probe or make it much more relaxed.

During this time the ignite cluster is completely frozen.  Restarting NODE D
and replacing it with NODE E did not solve the issue.  The only way I could
solve the problem is to restart NODE B.  Any idea why this could have
occurred or what I can do to prevent it in the future?

I do see this from the failureHandler: "FailureContext [type=CRITICAL_ERROR,
err=class org.apache.ignite.IgniteException: Failed to create string
representation of binary object.]" but not sure if this is something that
would have caused the cluster to seize up.

Overall nodes go down in this environment and come back all the time without
issues.  But I've seen problem occur twice in the last few months.

I have logs & thread dumps for all the nodes in the system so if you want me
to check anything in particular let me know.

thanks,
Scott



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: waiting for partition map exchange questions

Posted by Scott Feldstein <sc...@gmail.com>.
Hi Andrei, have you had a chance to check out these logs?

Thanks

On Sun, Apr 14, 2019 at 17:18 scottmf <sc...@gmail.com> wrote:

> ignite.tgz
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1632/ignite.tgz>
> (attached file)
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: waiting for partition map exchange questions

Posted by scottmf <sc...@gmail.com>.
ignite.tgz
<http://apache-ignite-users.70518.x6.nabble.com/file/t1632/ignite.tgz>  
(attached file)



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: waiting for partition map exchange questions

Posted by scottmf <sc...@gmail.com>.
thanks Andrei,

I've attached the files.  The outage occurred at approximately
2019-04-08T19:43Z

the host ending in 958sw is the host that went down at the start of the
outage.  Host ending in dldh2 came up after 958sw went down.

hwgpf and zq8j8 were up the entire time.

These are the server nodes.  Let me know if you need the client node logs or
want any metric data.

Scott



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: waiting for partition map exchange questions

Posted by aealexsandrov <ae...@gmail.com>.
Hi,

Yes without logs it's not easy to understand what is the reason. Could you
please attach them?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/