You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by Ilkka Virolainen <Il...@bitwise.fi> on 2018/05/02 07:01:50 UTC

Artemis 2.5.0 - Colocated scaledown cluster issues

Hello,

As well as some previous issues [1] I have some problems with my Artemis cluster. My setup [2] is a symmetric two node cluster of colocated instances with scaledown. As well as the node restart causing a problematic state in replication [1] there are other issues, namely:

1) After running for approximately two weeks one of the nodes crashed to heap space exhaustion. Heap dump analysis would indicate that this is due to cluster connection failing and millions of messages would end up in the internal store-and-forward queue causing an eventual OOM exception - I guess the internal messages are not paged?

2) I have now run the cluster for ~2 weeks and the cluster has ended up in a state where messages are being redistributed from node 1 to node 2 BUT not the other way around. This can be the same issue as 1) but I cannot tell for sure. I tried setting the core server logging level to DEBUG on node 2 and sending messages to a test topic but I get no references to the address name in Artemis logs.

I realize that it's difficult to address these problems given the information at hand and due to the problematic nature of the circumstances in which they occur: they (excl. the issue described in [1]) start to appear after running a cluster for a long time and there's no apparent cause or easy way of replication. I would however appreciate if anyone has tips to debug this issue further or has advice on where to look for a probable cause.

- Ilkka

[1] Backup voting issue: http://activemq.2283324.n4.nabble.com/Artemis-2-5-0-Problems-with-colocated-scaledown-td4737583.html#a4737808
[2] Sample brokers: https://github.com/ilkkavi/activemq-artemis/tree/scaledown-issue/issues/IssueExample/src/main/resources/activemq

RE: Artemis 2.5.0 - Colocated scaledown cluster issues

Posted by Ilkka Virolainen <Il...@bitwise.fi>.

Thank you for your response. Regarding the cluster reconnection, I used the setting shown in the scaledown examples:

    <!-- since the backup servers scale down we need a sensible setting here so the bridge will stop -->
    <reconnect-attempts>5</reconnect-attempts>

I'm not sure how much effect it has on this situation though (at least in case 2), since I didn't notice anything being logged about the bridge being stopped and JMX reported the cluster topology being intact. Just that the other node's internal forwarding queue, the one that had messages piling up, had consumers: 0. The other node's internal queue had the one consumer it was supposed to.

I am using a version built from master approx. two weeks ago.

BR,
- Ilkka

-----Original Message-----
From: Clebert Suconic <cl...@gmail.com> 
Sent: 3. toukokuuta 2018 18:26
To: users@activemq.apache.org
Subject: Re: Artemis 2.5.0 - Colocated scaledown cluster issues

On Wed, May 2, 2018 at 3:01 AM, Ilkka Virolainen <Il...@bitwise.fi> wrote:
> Hello,
>
> As well as some previous issues [1] I have some problems with my Artemis cluster. My setup [2] is a symmetric two node cluster of colocated instances with scaledown. As well as the node restart causing a problematic state in replication [1] there are other issues, namely:
>
> 1) After running for approximately two weeks one of the nodes crashed to heap space exhaustion. Heap dump analysis would indicate that this is due to cluster connection failing and millions of messages would end up in the internal store-and-forward queue causing an eventual OOM exception - I guess the internal messages are not paged?

You can configure it to paging...

Also.. on cluster conneciton you can configure the max-retry of the cluster-connectoin...

I'm not talking about replication here. .this is probably about another node that still connected.

>
> 2) I have now run the cluster for ~2 weeks and the cluster has ended up in a state where messages are being redistributed from node 1 to node 2 BUT not the other way around. This can be the same issue as 1) but I cannot tell for sure. I tried setting the core server logging level to DEBUG on node 2 and sending messages to a test topic but I get no references to the address name in Artemis logs.

Check what I talked about reconnects on cluster connection.

If you were using master.. there's a way you can consume messages from the internal queue.. and send them manually using producer / consumer.. you will need to get a snapshot from master.

>
> I realize that it's difficult to address these problems given the information at hand and due to the problematic nature of the circumstances in which they occur: they (excl. the issue described in [1]) start to appear after running a cluster for a long time and there's no apparent cause or easy way of replication. I would however appreciate if anyone has tips to debug this issue further or has advice on where to look for a probable cause.
>
> - Ilkka
>
> [1] Backup voting issue: 
> http://activemq.2283324.n4.nabble.com/Artemis-2-5-0-Problems-with-colo
> cated-scaledown-td4737583.html#a4737808
> [2] Sample brokers: 
> https://github.com/ilkkavi/activemq-artemis/tree/scaledown-issue/issue
> s/IssueExample/src/main/resources/activemq

--
Clebert Suconic

Re: Artemis 2.5.0 - Colocated scaledown cluster issues

Posted by Clebert Suconic <cl...@gmail.com>.

On Wed, May 2, 2018 at 3:01 AM, Ilkka Virolainen
<Il...@bitwise.fi> wrote:
> Hello,
>
> As well as some previous issues [1] I have some problems with my Artemis cluster. My setup [2] is a symmetric two node cluster of colocated instances with scaledown. As well as the node restart causing a problematic state in replication [1] there are other issues, namely:
>
> 1) After running for approximately two weeks one of the nodes crashed to heap space exhaustion. Heap dump analysis would indicate that this is due to cluster connection failing and millions of messages would end up in the internal store-and-forward queue causing an eventual OOM exception - I guess the internal messages are not paged?

You can configure it to paging...

Also.. on cluster conneciton you can configure the max-retry of the
cluster-connectoin...

I'm not talking about replication here. .this is probably about
another node that still connected.

>
> 2) I have now run the cluster for ~2 weeks and the cluster has ended up in a state where messages are being redistributed from node 1 to node 2 BUT not the other way around. This can be the same issue as 1) but I cannot tell for sure. I tried setting the core server logging level to DEBUG on node 2 and sending messages to a test topic but I get no references to the address name in Artemis logs.

Check what I talked about reconnects on cluster connection.



If you were using master.. there's a way you can consume messages from
the internal queue.. and send them manually using producer /
consumer.. you will need to get a snapshot from master.


>
> I realize that it's difficult to address these problems given the information at hand and due to the problematic nature of the circumstances in which they occur: they (excl. the issue described in [1]) start to appear after running a cluster for a long time and there's no apparent cause or easy way of replication. I would however appreciate if anyone has tips to debug this issue further or has advice on where to look for a probable cause.
>
> - Ilkka
>
> [1] Backup voting issue: http://activemq.2283324.n4.nabble.com/Artemis-2-5-0-Problems-with-colocated-scaledown-td4737583.html#a4737808
> [2] Sample brokers: https://github.com/ilkkavi/activemq-artemis/tree/scaledown-issue/issues/IssueExample/src/main/resources/activemq



-- 
Clebert Suconic