You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@geode.apache.org by "Alexander Murmann (Jira)" <ji...@apache.org> on 2021/04/05 21:58:00 UTC

[jira] [Updated] (GEODE-8809) Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven

     [ https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Murmann updated GEODE-8809:
-------------------------------------
    Labels:   (was: blocks-1.14.0)

> Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven
> ----------------------------------------------------------------
>
>                 Key: GEODE-8809
>                 URL: https://issues.apache.org/jira/browse/GEODE-8809
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Nabarun Nag
>            Assignee: Bill Burcham
>            Priority: Major
>
> We see this characteristic failure in a number of proprietary applications:
>  * member stops sending heartbeats
>  * The coordinator is requesting availability test from a member, 
>  * member gets it after a delay
>  * the delay causes the server to be kicked out (receives FordedDisconnectException)
>  * operations fail.
>  * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of health monitoring had a problem)
>   a. Network problems
>     i. Partition: 2-way, N-way
>     ii. Slowdown or error rate increase
>   b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or more in heartbeat generation on that member.
>     i. Geode was running in a virtualized environment and the virtualization system didn’t give the Geode process sufficient CPU
>     ii. JVM memory was over-utilized so garbage collection (pauses) took too long
>     iii. There was simply too much CPU demand and the product failed to reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to be kicked out *but we cannot prove definitively that any of these as a root cause*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)