You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Stephan Austermühle (Jira)" <ji...@apache.org> on 2021/12/13 12:49:00 UTC

[jira] [Created] (ARTEMIS-3606) Broker does not discard absent replica instances

Stephan Austermühle created ARTEMIS-3606:
--------------------------------------------

             Summary: Broker does not discard absent replica instances
                 Key: ARTEMIS-3606
                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3606
             Project: ActiveMQ Artemis
          Issue Type: Bug
          Components: Broker
    Affects Versions: 2.19.0
            Reporter: Stephan Austermühle


We have deployed ActiveMQ Artemis v2.19.0 in a HA+Cluster configuration, hosted on Kubernetes (non-cloud) and use the JGroups {{KUBE_PING}} for the broker discovery. During regular operations, we have 2 primaries and 2 replica brokers and everything looks fine.

For testing, we now remove the replica instances (no Pods left) – and end up with a weird cluster state: 2 primaries – and 1 zombie replica connected to primary 1. The replica instances were shut down (scaling the corresponding StatefulSet to zero), i.e., no hard kill.

Restarting the replicas brings the cluster back to a normal state – sometimes.

[According to the docs|https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#discovery-groups], the missing broker instances should be removed:
{quote}If it has not received a broadcast from a particular server for a length of time it will remove that server's entry from its list.
{quote}
The broker also includes the zombie replicas in its topology update to JMS clients resulting in about >30 connection attempts per second in our case. Since Kubernetes does not know the shutdown replica broker instances anymore, the client’s name resolution ends with {{{}Cannot resolve host{}}}. This finally leads to the client eating a whole CPU core on connection attempts and logging the failure.

By the way, the JMS client should wait for some time after a {{Cannot resolve host}} exception instead of retrying immediately. Looks like the pause parameters retryInterval,
retryIntervalMultiplier, maxRetryInterval, and reconnectAttempts don’t have any effect.

[~brusdev] commented:
{quote}logs confirm no messages from replicas so the issue isn't caused by jgroups, it could be due to a bug on propagating cluster topology updates. The cluster topology updates are sent using ClusterTopologyChangeMessage.
{quote}
Please see [https://stackoverflow.com/q/70288344/6529100] for additional information, logs, configuration.

Maybe it is worth mentioning that shutting down a primary (master) instance works as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)