You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@activemq.apache.org by Thai Le <ln...@gmail.com> on 2022/02/09 18:00:25 UTC

Artemis high availability in Kubernetes

Hello,

We have been running artemis 2.17 with replication HA policy (1 master and
1 slave) in kubnernetes for a few months. I was advised to run artemis
without HA in kubernetes since pod will be restarted anyway but my setup
was a team decision so I did not make any change. Recently we had a few
incidents in which the master went down or both the mater and slave when
down at the same time. In all cases, the sender and the consumer throw
exception that connection to broker was interrupted. This means connection
does not seamlessly transfer to slave but requires the client to reconnect,
same as if I don't have replica configured and only rely on kubernetes pod
restart. The only disadvantage of not having replica is the longer downtime
since kubernetes may take a minute to restart the pod. So, with that, is
the recommended way to run Artemis in kubernetes to be without HA still
hold?

Regards

Thai Le

RE: Artemis high availability in Kubernetes

Posted by Vilius Šumskas <vi...@rivile.lt>.
Hi,

+1 from me on the recommendations regarding Artemis on Kubernetes from core developers.

I can only share how we are currently doing our HA.

I've invested weeks of my time to investigate regarding having master/slave Artemis shared disk cluster with the following configurations:
1) Artemis HA on VM machines (GCE or similar).
2) Artemis HA on Kubernetes (GKE or similar).
3) Artemis HA with Operator under Kubernetes.
4) Artemis without HA config and just leave HA for Kubernetes and Cloud load balancer.

We went with option 1) because as far as I far as I could gather info:
* Option 2 requires shared storage with Kubernetes which is not available on any cloud provider. You must hack some kind of Kubernetes system with Ceth or glusterfs support yourself.
* Option 3 is early alpha and the stability could be questionable
* Option 4 is not really HA solution because if a nodes attached to persistent storage fails you will have to somehow redeploy different PVC config on different node manually, or you again must have a shared disk solution which is not available on any cloud provider.

Our option 1 is configured on GCE with a shared NetApp NFS4.1 volume support. So far we are happy with stability. NetApp pricing could be better though 😐

It would be interesting to know what others are doing, especially when talking about Google Cloud?

-- 
    Vilius

-----Original Message-----
From: Jo De Troy <jo...@gmail.com> 
Sent: Thursday, February 10, 2022 12:55 PM
To: users@activemq.apache.org
Subject: Re: Artemis high availability in Kubernetes

Hello,

I'm also interested in the recommended setup on Kubernetes for a HA ActiveMQ Artemis broker.
What is possible, what is not?  Active/Passive setup is that possible/supported on Kubernetes, or does that not make sense?
What is recommended?

Best Regards,
Jo

Op wo 9 feb. 2022 om 19:00 schreef Thai Le <ln...@gmail.com>:

> Hello,
>
> We have been running artemis 2.17 with replication HA policy (1 master 
> and
> 1 slave) in kubnernetes for a few months. I was advised to run artemis 
> without HA in kubernetes since pod will be restarted anyway but my 
> setup was a team decision so I did not make any change. Recently we 
> had a few incidents in which the master went down or both the mater 
> and slave when down at the same time. In all cases, the sender and the 
> consumer throw exception that connection to broker was interrupted. 
> This means connection does not seamlessly transfer to slave but 
> requires the client to reconnect, same as if I don't have replica 
> configured and only rely on kubernetes pod restart. The only 
> disadvantage of not having replica is the longer downtime since 
> kubernetes may take a minute to restart the pod. So, with that, is the 
> recommended way to run Artemis in kubernetes to be without HA still hold?
>
> Regards
>
> Thai Le
>

Re: Artemis high availability in Kubernetes

Posted by Jo De Troy <jo...@gmail.com>.
Hello,

I'm also interested in the recommended setup on Kubernetes for a HA
ActiveMQ Artemis broker.
What is possible, what is not?  Active/Passive setup is that
possible/supported on Kubernetes, or does that not make sense?
What is recommended?

Best Regards,
Jo

Op wo 9 feb. 2022 om 19:00 schreef Thai Le <ln...@gmail.com>:

> Hello,
>
> We have been running artemis 2.17 with replication HA policy (1 master and
> 1 slave) in kubnernetes for a few months. I was advised to run artemis
> without HA in kubernetes since pod will be restarted anyway but my setup
> was a team decision so I did not make any change. Recently we had a few
> incidents in which the master went down or both the mater and slave when
> down at the same time. In all cases, the sender and the consumer throw
> exception that connection to broker was interrupted. This means connection
> does not seamlessly transfer to slave but requires the client to reconnect,
> same as if I don't have replica configured and only rely on kubernetes pod
> restart. The only disadvantage of not having replica is the longer downtime
> since kubernetes may take a minute to restart the pod. So, with that, is
> the recommended way to run Artemis in kubernetes to be without HA still
> hold?
>
> Regards
>
> Thai Le
>

Re: Artemis high availability in Kubernetes

Posted by Gary Tully <ga...@gmail.com>.
if you are using multiple brokers, it is best to distribute across
pods, such that a single pod failure do not result in a complete
outage.

On Fri, 11 Feb 2022 at 15:10, Thai Le <ln...@gmail.com> wrote:
>
> Hi guys
> Thank you very much for sharing.
> @Vilius I have tried the artemis-operator and the setup is much simpler
> than my current one, although it is best used for scalability and there is
> no slave deployed so the HA is consider the same as option 4. The only
> thing prevents us from using it is the missing configuration for
> bootstrap.xml.
>
> @Jo In our setup, the producer uses spring jms template to send msg so it
> has to catch the JmsException which is the runtime exception and retry a
> few times.
>
> @Gary, in your single pod section, you mentioned using multiple brokers,
> does it mean running a cluster of multiple brokers in a single pod or each
> broker on one pod?
>
> Regards
>
> Thai Le

Re: Artemis high availability in Kubernetes

Posted by Thai Le <ln...@gmail.com>.
Hi guys
Thank you very much for sharing.
@Vilius I have tried the artemis-operator and the setup is much simpler
than my current one, although it is best used for scalability and there is
no slave deployed so the HA is consider the same as option 4. The only
thing prevents us from using it is the missing configuration for
bootstrap.xml.

@Jo In our setup, the producer uses spring jms template to send msg so it
has to catch the JmsException which is the runtime exception and retry a
few times.

@Gary, in your single pod section, you mentioned using multiple brokers,
does it mean running a cluster of multiple brokers in a single pod or each
broker on one pod?

Regards

Thai Le

Re: Artemis high availability in Kubernetes

Posted by Jo De Troy <jo...@gmail.com>.
Gary,

so an HA solution for Artemis running on Kubernetes is not worth it, as we
expect Kubernetes to recover anyway?
If a producer loses connection to the Artemis instance would you not lose
the data? Or would a typical client try to resubmit it, or would the
client/application need to be designed for that?
A multi-broker setup on Kubernetes would typically be for scalability, not
for HA? Or when would you consider a cluster of brokers?
 Forgive me for my stupid questions, I'm pretty new to the message broker
world.
If the broker is using persistent storage I understand we don't expect to
lose the data once it's on the broker (queue/topic).

Best Regards,
Jo

Op vr 11 feb. 2022 om 12:10 schreef Gary Tully <ga...@gmail.com>:

> Hello,
> the reconnect issue? How are your clients configured? Do they get
> topology from the pair of brokers on kube?
>
> --
> On re-connection:
>
> failover with the Artemis  jms client will only occur between pairs.
> It is restricted in that way to protect users of temp queues and
> durable subs, b/c those resource will only be available on a replica.
> Other clients (openwire or AMQP qpid jms) are not as protective by
> default so will continue to try and reconnect.
>
> In any event, from an application perspective, it is often best to
> hide the JMS connection with something like camel-jms or spring jms
> template. In that way, the error handling can be separately controlled
> and isolated from protocol specifics. It is an extra level of
> indirection with sensible defaults that can be tweaked as needed.
> If the broker url is behind a proxy/loadblancer/firewall or dns or
> some other mechanism that is broker topology agnostic, it can help.
>
>
> --
> On the replication vs single broker pod:
>
> these are very different, with replication there are two copies of
> your data, with a single pod there is only one copy.
>
> - Single Pod:
> Because kube does a good (if slow) job of auto restarting it makes
> sense to leverage it to keep your single journal available. It is very
> intuitive and simple.
> If order is not important, cluster with a second broker and allow
> client to use either.
> If order is important, consider using multiple brokers and
> partitioning data across them[1]. In that way, you can always be
> partially available.
>
> - Replica Pods
> If you need two copies of your data, then you need replication, and
> this is more involved.
> With replication, there is synchronous copy. There are two overheads
> to consider.
> First, at runtime the broker responds when the *replica* gets the
> message, which is usually trivial b/c of a fast network; but it is
> important to be aware of.
> Second, is the overhead of coordination on activation. Because there
> are two copies of the journal, only one can be active at a time.
>  - as part of [2] we introduced coordinated activation via zk. I think
> we probably need a kube version of this that layers over a crd or some
> other etcd primitive. An operator could server the role of an oracle
> here also. I note that the operator sdk provides a leader election
> primitive [3] that may be perfect. The only issue may be permissions.
> There will still be some necessary delay, lease expiration time etc..
> but it should be possible to make this time limited and bounded. There
> is a bit of work to do here.
> In short, having a replica pod should reduce time to recover but at a cost.
>
> After putting these thoughts down, I think the short answer is yes, in
> kubernetes a single pod is currently best. It was a good question :-)
>
> feedback welcome!
> I hope this helps,
> gary.
>
> [1]
> https://activemq.apache.org/components/artemis/documentation/latest/broker-balancers.html#data-gravity
> [2]
> https://activemq.apache.org/components/artemis/documentation/latest/ha.html#Pluggable-Quorum-Vote-Replication-configurations
> [3] https://github.com/operator-framework/operator-sdk/issues/784
>

Re: Artemis high availability in Kubernetes

Posted by Gary Tully <ga...@gmail.com>.
Hello,
the reconnect issue? How are your clients configured? Do they get
topology from the pair of brokers on kube?

--
On re-connection:

failover with the Artemis  jms client will only occur between pairs.
It is restricted in that way to protect users of temp queues and
durable subs, b/c those resource will only be available on a replica.
Other clients (openwire or AMQP qpid jms) are not as protective by
default so will continue to try and reconnect.

In any event, from an application perspective, it is often best to
hide the JMS connection with something like camel-jms or spring jms
template. In that way, the error handling can be separately controlled
and isolated from protocol specifics. It is an extra level of
indirection with sensible defaults that can be tweaked as needed.
If the broker url is behind a proxy/loadblancer/firewall or dns or
some other mechanism that is broker topology agnostic, it can help.


--
On the replication vs single broker pod:

these are very different, with replication there are two copies of
your data, with a single pod there is only one copy.

- Single Pod:
Because kube does a good (if slow) job of auto restarting it makes
sense to leverage it to keep your single journal available. It is very
intuitive and simple.
If order is not important, cluster with a second broker and allow
client to use either.
If order is important, consider using multiple brokers and
partitioning data across them[1]. In that way, you can always be
partially available.

- Replica Pods
If you need two copies of your data, then you need replication, and
this is more involved.
With replication, there is synchronous copy. There are two overheads
to consider.
First, at runtime the broker responds when the *replica* gets the
message, which is usually trivial b/c of a fast network; but it is
important to be aware of.
Second, is the overhead of coordination on activation. Because there
are two copies of the journal, only one can be active at a time.
 - as part of [2] we introduced coordinated activation via zk. I think
we probably need a kube version of this that layers over a crd or some
other etcd primitive. An operator could server the role of an oracle
here also. I note that the operator sdk provides a leader election
primitive [3] that may be perfect. The only issue may be permissions.
There will still be some necessary delay, lease expiration time etc..
but it should be possible to make this time limited and bounded. There
is a bit of work to do here.
In short, having a replica pod should reduce time to recover but at a cost.

After putting these thoughts down, I think the short answer is yes, in
kubernetes a single pod is currently best. It was a good question :-)

feedback welcome!
I hope this helps,
gary.

[1] https://activemq.apache.org/components/artemis/documentation/latest/broker-balancers.html#data-gravity
[2] https://activemq.apache.org/components/artemis/documentation/latest/ha.html#Pluggable-Quorum-Vote-Replication-configurations
[3] https://github.com/operator-framework/operator-sdk/issues/784