You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@pekko.apache.org by "fredfp (via GitHub)" <gi...@apache.org> on 2023/08/17 13:05:19 UTC

[GitHub] [incubator-pekko] fredfp opened a new issue, #578: Clustering issues leading to all nodes being downed

fredfp opened a new issue, #578:
URL: https://github.com/apache/incubator-pekko/issues/578

I'm reopening here an issue that I reported at the time under the akka repo.

We had a case where an issue on a single node lead to the whole akka-cluster being taken down.

### Here's a summary of what happened:
1. Healthy cluster made of 20ish nodes, running on k8s
2. Node A: encounters issues, triggers CoordinatedShutdown
3. Node A: experiences high CPU usage, maybe GC pause
4. Node A: sees B as unreachable, broadcasts it (B is certainly reachable, but detected as such because of high CPU usage, GC pause, or similar issues)
5. Cluster state: A Leaving, B seen unreachable by A, all the other nodes are Up
6. Leader can currently not perform its duties (remove A), reachability status (B seen unreachable by A)
7. Node A: times out some coordinated shutdown phases. Hypothesis: timed out because leader could not remove A.
8. Node A: finishes coordinated shutdown nonetheless.
9. hypothesis - Node A: quarantined associations to other cluster nodes
10. Nodes B, C, D, E: SBR took decision DownSelfQuarantinedByRemote and is downing [...] including myself
11. hypothesis - Node B, C, D, E: quarantined associations to other cluster nodes
12. in a few steps, all remaining cluster nodes down themselves: SBR took decision DownSelfQuarantinedByRemote
13. the whole cluster is down

### Discussions, potential issues:

Considering the behaviour of CoordinatedShutdown (phases can time out and shutdown continues), shouldn't the leader ignore unreachabilities added by a Leaving node and be allowed to perform its duties?
At step 6 above, the Leader was blocked from removing A, but A still continued its shutdown process. The catastrophic ending could have been stopped here.

DownSelfQuarantinedByRemote: @patriknw 's [comment](https://github.com/akka/akka/pull/29737#discussion_r515906571) seems spot on.
At step 9, nodes B, C, D, E should probably not take into account the `Quarantined` from a node that is Leaving.

DownSelfQuarantinedByRemote: another case where Patrik's [comment](https://github.com/akka/akka/pull/29737#discussion_r515906571) also seems to apply, `Quarantined` from nodes downing themselves because of DownSelfQuarantinedByRemote should probably not be taken into account.

At steps 10 and 12. Any cluster singletons running on affected nodes wouldn't be gracefully shutdown using the configured termination message. This is probably the right thing to do but I'm adding this note here nonetheless.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] fredfp commented on issue #578: Clustering issues leading to all nodes being downed

Posted by "fredfp (via GitHub)" <gi...@apache.org>.

fredfp commented on issue #578:
URL: https://github.com/apache/incubator-pekko/issues/578#issuecomment-1682259440

   I have extra logs that may be useful:
   
   > Remote ActorSystem must be restarted to recover from this situation. Reason: Cluster member removed, previous status [Down]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org