You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by AntonR <an...@volvo.com> on 2020/04/17 10:33:38 UTC

Artemis cluster - Messages stuck in Delivering state

Hi,

I have an issue with the Artemis broker which I am having troubles solving
and also reproduce outside of my testing environment.

The setup is the following: 3 Artemis brokers running on separate servers,
clustered in an Active-Active fashion with static connectors

The clients are running JBoss 6 with the ActiveMQ 5 RA

Messages are processed in XA transactions with MDBs

All clients (16 of them, multiple queues each, no topics) use one separate
RA (both old and new version tested) for each broker and use the failover
protocol with prioritybackup=true and randomize=false, each RA connecting to
server 1, 2 or 3 and are set to fail over to the next broker in line in the
event of a broker becoming unavailable. This is done in order to achieve
both Load balancing and redundancy.
The environment is set up like this because it used to run with ActiveMQ 5
brokers as well, and this made sense at the time.

The problem I am seeing with the Artemis brokers is that after a
failover-failback scenario, so if a broker goes down and later comes back
up, messages get stuck in the "Delivering" state and the only way to get
them to roll back is to restart the broker. After a restart though, this
problem persists, so the clients will "prefetch" up to their limit again and
then stop.

There is no timeout happening, messages stay like this forever and the only
solution to this state is to either restart the clients or stop all Artemis
brokers, start an ActiveMQ 5 broker for ~10 seconds and then start the
Artemis brokers again. This happens on all broker restarts, but not to all
clients at once, so I would guess this is some sort of a timing issue.

I have tried changing every possible config I can think of without any
effect and have yet to be able to reproduce this issue outside of this
(legacy) test environment. I run Artemis in several other environments with
newer clients (but who mostly run ActiveMQ5 clients but without JBoss, MDB
and XA) and have zero issues.

Some things I have noticed but have yet to piece together:

The connectionID for the consumer that holds the messages "Delvering" does
not exist, so in Hawtio I can trace the messages to a consumer, that
consumer has a corresponding Session but the session does not have an
associated connection. (there is a connectionID reported but if i click on
it or search for it, it does not exist)

The DeliveringCount goes to 1000 messages for each consumer, which is the
Openwire default for prefetched messages, but most clients use
prefetchPolicy.all=100, which is otherwise respected

Artemis reports "Error during buffering operation", see attached file 
artemis_stacktrace.txt
<http://activemq.2283324.n4.nabble.com/file/t378961/artemis_stacktrace.txt>  

A thread dump on the clients report that basically all JMS related threads
are stuck at the same place, see attached file  client_threads.txt
<http://activemq.2283324.n4.nabble.com/file/t378961/client_threads.txt>  

Br,
Anton



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis cluster - Messages stuck in Delivering state

Posted by AntonR <an...@volvo.com>.

I realize this might difficult to answer as I myself am unable to reproduce
the issue in a simplified environment... but this really has me stumped and
it is currently the only thing keeping me from being able to migrate over to
the Artemis broker from ActiveMQ.

Is there anything else you can think of that I can try to do or any other
info I might be able to provide to aid in troubleshooting?

Br,
Anton



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis cluster - Messages stuck in Delivering state

Posted by Clebert Suconic <cl...@gmail.com>.

if you isolate the issue with a testcase... perhaps someone would be
able to take a look.

On Fri, May 29, 2020 at 5:31 AM AntonR <an...@volvo.com> wrote:
>
> I understand, but that is not something I can implement so that would mean I
> am stuck with the old broker solution for all environments running these
> specific application setups.
>
> Further testing shows that the "initialRedeliveryDelay" does not really work
> by the way, messages still get stuck in the "Delivering" state indefinitely,
> regardless of timeout duration. With this setting the lockup releases after
> broker restart though, but a small number of messages still get stuck, in
> the order of 5-20 messages.
>
> Do you have any thoughts on what might be happening otherwise?
> I think the key points of the issue are:
> 1. Messages in XA transactions get stuck indefinitely in "Delivering state"
> after a broker failover using the openwire protocol (at least for days, no
> timeout occurs that I can see, but maybe they roll back just to get stuck
> again).
> 2. Hawtio reports these queues have additional consumers (with IDs belonging
> to the clients that hold the messages in Delivering) connected to them, but
> their corresponding sessions point to a non-existent connectionID.
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html



-- 
Clebert Suconic

Re: Artemis cluster - Messages stuck in Delivering state

Posted by AntonR <an...@volvo.com>.

I understand, but that is not something I can implement so that would mean I
am stuck with the old broker solution for all environments running these
specific application setups.

Further testing shows that the "initialRedeliveryDelay" does not really work
by the way, messages still get stuck in the "Delivering" state indefinitely,
regardless of timeout duration. With this setting the lockup releases after
broker restart though, but a small number of messages still get stuck, in
the order of 5-20 messages.

Do you have any thoughts on what might be happening otherwise? 
I think the key points of the issue are:
1. Messages in XA transactions get stuck indefinitely in "Delivering state"
after a broker failover using the openwire protocol (at least for days, no
timeout occurs that I can see, but maybe they roll back just to get stuck
again). 
2. Hawtio reports these queues have additional consumers (with IDs belonging
to the clients that hold the messages in Delivering) connected to them, but
their corresponding sessions point to a non-existent connectionID.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis cluster - Messages stuck in Delivering state

Posted by Clebert Suconic <cl...@gmail.com>.

Well.. for XA, there are a lot more tests with Artemis through Wildfly
/ Core Protocol. for XA/MDBs I highly recommend the core protocol.

On Thu, May 28, 2020 at 7:33 PM AntonR <an...@volvo.com> wrote:
>
> Hi,
>
> Basically they all currently use that adapter because they are currently
> running against the ActiveMQ 5 broker. I am in the process of moving towards
> the Artemis broker in all environments because of its superior performance
> and some much needed additional features. As I have understood it, it should
> be compatible with ActiveMQ 5 clients? It certainly works for all other
> clients I've tried with...
>
> Down the line some of these jboss xa clients will get updated to use the
> Core protocol instead, but some are getting decommissioned within a year or
> two and as such very little work besides maintenance is going into them.
>
> The main reason though, is that there are too many business critical systems
> running together so that coordinating such a change in one go is almost
> impossible. My thought was that since the Artemis broker should be backwards
> compatible with the current clients, they could update independently of the
> broker over time.
>
> Br,
> Anton
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html



-- 
Clebert Suconic

Re: Artemis cluster - Messages stuck in Delivering state

Posted by AntonR <an...@volvo.com>.

Hi,

Basically they all currently use that adapter because they are currently
running against the ActiveMQ 5 broker. I am in the process of moving towards
the Artemis broker in all environments because of its superior performance
and some much needed additional features. As I have understood it, it should
be compatible with ActiveMQ 5 clients? It certainly works for all other
clients I've tried with...

Down the line some of these jboss xa clients will get updated to use the
Core protocol instead, but some are getting decommissioned within a year or
two and as such very little work besides maintenance is going into them.

The main reason though, is that there are too many business critical systems
running together so that coordinating such a change in one go is almost
impossible. My thought was that since the Artemis broker should be backwards
compatible with the current clients, they could update independently of the
broker over time.

Br,
Anton



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis cluster - Messages stuck in Delivering state

Posted by Clebert Suconic <cl...@gmail.com>.

so, why are you using the openwire adapter against Artemis?

as you're running into XA, you should probably use the native adapter
from JBoss 6 towards artemis using the core protocol.


Or are you using the binaries without modifications?

On Fri, Apr 17, 2020 at 6:33 AM AntonR <an...@volvo.com> wrote:
>
> Hi,
>
> I have an issue with the Artemis broker which I am having troubles solving
> and also reproduce outside of my testing environment.
>
> The setup is the following: 3 Artemis brokers running on separate servers,
> clustered in an Active-Active fashion with static connectors
>
> The clients are running JBoss 6 with the ActiveMQ 5 RA
>
> Messages are processed in XA transactions with MDBs
>
> All clients (16 of them, multiple queues each, no topics) use one separate
> RA (both old and new version tested) for each broker and use the failover
> protocol with prioritybackup=true and randomize=false, each RA connecting to
> server 1, 2 or 3 and are set to fail over to the next broker in line in the
> event of a broker becoming unavailable. This is done in order to achieve
> both Load balancing and redundancy.
> The environment is set up like this because it used to run with ActiveMQ 5
> brokers as well, and this made sense at the time.
>
> The problem I am seeing with the Artemis brokers is that after a
> failover-failback scenario, so if a broker goes down and later comes back
> up, messages get stuck in the "Delivering" state and the only way to get
> them to roll back is to restart the broker. After a restart though, this
> problem persists, so the clients will "prefetch" up to their limit again and
> then stop.
>
> There is no timeout happening, messages stay like this forever and the only
> solution to this state is to either restart the clients or stop all Artemis
> brokers, start an ActiveMQ 5 broker for ~10 seconds and then start the
> Artemis brokers again. This happens on all broker restarts, but not to all
> clients at once, so I would guess this is some sort of a timing issue.
>
> I have tried changing every possible config I can think of without any
> effect and have yet to be able to reproduce this issue outside of this
> (legacy) test environment. I run Artemis in several other environments with
> newer clients (but who mostly run ActiveMQ5 clients but without JBoss, MDB
> and XA) and have zero issues.
>
> Some things I have noticed but have yet to piece together:
>
> The connectionID for the consumer that holds the messages "Delvering" does
> not exist, so in Hawtio I can trace the messages to a consumer, that
> consumer has a corresponding Session but the session does not have an
> associated connection. (there is a connectionID reported but if i click on
> it or search for it, it does not exist)
>
> The DeliveringCount goes to 1000 messages for each consumer, which is the
> Openwire default for prefetched messages, but most clients use
> prefetchPolicy.all=100, which is otherwise respected
>
> Artemis reports "Error during buffering operation", see attached file
> artemis_stacktrace.txt
> <http://activemq.2283324.n4.nabble.com/file/t378961/artemis_stacktrace.txt>
>
> A thread dump on the clients report that basically all JMS related threads
> are stuck at the same place, see attached file  client_threads.txt
> <http://activemq.2283324.n4.nabble.com/file/t378961/client_threads.txt>
>
> Br,
> Anton
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html



-- 
Clebert Suconic

Re: Artemis cluster - Messages stuck in Delivering state

Posted by AntonR <an...@volvo.com>.

I just found this old thread and a Jira describing what seems to be the same
or a very similar issue, but I have been unable to replicate the issue with
the method described there:

http://activemq.2283324.n4.nabble.com/Artemis-all-messages-go-to-quot-Delivering-quot-after-a-client-crash-td4702940.html



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis cluster - Messages stuck in Delivering state

Posted by AntonR <an...@volvo.com>.

I don't know if anyone is looking into this or have any ideas, but I have
made some new discoveries that might help in figuring out what is going on.

I still have not been able to replicate the issue in a smaller/more
controlled environment, even though pretty much all is the same in regards
to broker, configuration application and client setup. I suspect it might be
caused in part due to the number of clients in the real environment,
something I can not really simulate.

What I have found though, is two workarounds, neither of which are ideal,
but maybe they can give a hint to someone other than me.

Workaround 1: If I remove failover nodes in the RA configuration the problem
won't appear, so that means the config is roughly:
RA1:
failover:(tcp://broker1:61616)?nested.soLinger=10&nested.soTimeout=200000&jms.rmIdFromConnectionId=true&maxReconnectAttempts=0
RA2:
failover:(tcp://broker2:61616)?nested.soLinger=10&nested.soTimeout=200000&jms.rmIdFromConnectionId=true&maxReconnectAttempts=0
And so on...

This eliminates the issue entirely, but at the cost of one RA and
corresponding MDB not failing over and thus are unable to perform any work
for the duration of broker downtime.

Workaround 2: If I add "initialReconnectDelay" to a value of 5000 or more
this sort of fixes the issue.
Example of one RA connectionURL:
failover:(tcp://broker1:61616,tcp://broker2:61616,tcp://broker3:61616)?nested.soLinger=10&nested.soTimeout=200000&jms.rmIdFromConnectionId=true&randomize=false&priorityBackup=true&maxReconnectAttempts=0&initialReconnectDelay=5000

This kind of works, but at least with 5000 delay I still get the lock every
now and then, with the upside that an additional broker restart fixes it. I
do not want this setup in a production environment but at least it sort of
works without any major impact on application performance.

Without much evidence to support it I think the issue might be explained in
the  ActiveMQ Failover documentation
<https://activemq.apache.org/failover-transport-reference>  . Under
transactions they describe an issue that sounds sort of similar to what I am
seeing, but release notes for the fix version seem to be offline so I have
need unable to track the specific fix implemented. Perhaps it is something
that could be adopted in the Artemis broker as well?

Any thought on this? Or is there something inherently incompatible with my
setup and the Artemis broker?

Br,
Anton



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html