You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Alexander Murmann (Jira)" <ji...@apache.org> on 2021/04/05 21:58:00 UTC
[jira] [Updated] (GEODE-8809) Member Stops Sending Heartbeats, CPU
Saturation Cannot Be Proven
[ https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Murmann updated GEODE-8809:
-------------------------------------
Labels: (was: blocks-1.14.0)
> Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven
> ----------------------------------------------------------------
>
> Key: GEODE-8809
> URL: https://issues.apache.org/jira/browse/GEODE-8809
> Project: Geode
> Issue Type: Bug
> Components: messaging
> Reporter: Nabarun Nag
> Assignee: Bill Burcham
> Priority: Major
>
> We see this characteristic failure in a number of proprietary applications:
> * member stops sending heartbeats
> * The coordinator is requesting availability test from a member,
> * member gets it after a delay
> * the delay causes the server to be kicked out (receives FordedDisconnectException)
> * operations fail.
> * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of health monitoring had a problem)
> a. Network problems
> i. Partition: 2-way, N-way
> ii. Slowdown or error rate increase
> b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or more in heartbeat generation on that member.
> i. Geode was running in a virtualized environment and the virtualization system didn’t give the Geode process sufficient CPU
> ii. JVM memory was over-utilized so garbage collection (pauses) took too long
> iii. There was simply too much CPU demand and the product failed to reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to be kicked out *but we cannot prove definitively that any of these as a root cause*.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)