You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by sravan <sr...@gmail.com> on 2017/06/21 18:51:53 UTC

Messages are stuck in ActiveMQ 5.11 and delivered for after more than 24 hours

Our batch processing applications process abut 10 Billion messages a day.
From past two months we have been experiencing an issue with ActiveMQ where
ActiveMQ delivers messages very late, sometimes messages are delivered 3
days later. Daily, in the worst case 5% (and mostly 1%) of messages are
delivered one day later. We do not have message expiration policy by the way
that controls what should happen when a message is not delivered for a
certain period of time. We monitored our splunk based logs and do not see
any exceptions or errors that indicate any issues with consumer connections.
We could not turn on additional logging on ActiveMQ as that will cause a
huge hit on performance. So we mostly relied on monitoring ActiveMQ consoles
and Dynatarce. There is no server resource utilization issues on AMQs. By
the way we have 4 AMQ nodes active in the cluster. When we monitored
activeMQ consoles we saw messages stuck in network bridge sometimes. When I
say stuck, what I mean is that message draining was extremely slow and in a
period of 2 hours I noticed only handful of messages getting drained.
Whenever we restart a AMQ node any stuck messages on that node are getting
released (I think this is a known fact to all AMQ users). What is most
frustrating is, number of stuck messages we have noticed do not correlate to
number of messages delivered lately (i.e. delivered next day). So we are
under an impression that there could be invisible stuck messages. BTW this
issue started happening ever since we applied a Linux patch
(redhat-release-5Server-5.11.0.9 /
autofs-5.0.1-0.rc2.186.el5_11) on ActiveMQ nodes. We did some research in
the forums to check if there is any incompatibility between AMQ 5.11 and
this Linux patch but could not find anything. Does anyone here have any
ideas suggestions how we can troubleshoot this issue further? Any inputs
would be greatly appreciated.

--
View this message in context: http://activemq.2283324.n4.nabble.com/Messages-are-stuck-in-ActiveMQ-5-11-and-delivered-for-after-more-than-24-hours-tp4727694.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Messages are stuck in ActiveMQ 5.11 and delivered for after more than 24 hours

Posted by Tim Bain <tb...@alumni.duke.edu>.

I repeat my earlier suggestion that you should use a sampler on all brokers
to characterize where the time is being spent.

The fact that turning logging to debug causes the broker to experience a
slowdown almost immediately makes it seem like maybe there's an issue with
disk I/O or space, but that's just a guess and using a sampler will give
you something more concrete than a guess.

Also, you've characterized the problem as messages getting "stuck" in the
network connectors. When this happens, are individual messages truly stuck
(i.e. no messages are being passed), or is it simply that the rate they're
flowing out at os lower than the rate they're flowing in at (so there's a
net backup but individual messages are still being passed)? And when
messages are passed, do they arrive in order, or do they show up vastly
different from the order in which they were sent?

Tim

On Jul 11, 2017 4:29 PM, "sravan" <sr...@gmail.com> wrote:

> Unfortunately we are in a predicament where we have issues in reproducing
> the
> problem in performance test environments as well as debugging. ActiveMQ
> INFO
> logs does not contain much information to understand what's going on. When
> we turn on DEBUG logging (in lower env) ActiveMQ is hanging just in few
> minutes and never comes back. So far the only one clue we are relying on
> is,
> while monitoring ActiveMQ consoles in our prod environment, we observed
> messages stuck on network bridge for multiple hours. Also in our research,
> we know that consumers are doing fine and the main issue is with ActiveMQ
> which is delivering messages very late and we simply don't know the root
> cause. When we worked with an ActiveMQ consultant a year ago, we were
> advised to scale ActiveMQs  vertically  rather than scaling them
> horizontally. i.e. Have 2 larger AMQ nodes rather than having 4 smaller
> nodes in a cluster. We were told that this avoids many potential anomalies
> with network brdige and shipping messages across the nodes etc...Does
> anyone
> here have any suggestions with respect to how else we could debug or fix
> our
> issue. Again just to refresh your minds, the main issue we’ve been
> struggling with is...Our AMQs are delivering messages extremely late,
> sometimes a day or even two days later. We do not see any JMS exceptions,
> and no exceptions or issues at consumer's end. Any inputs are greatly
> appreciated.
>
>
>
> --
> View this message in context: http://activemq.2283324.n4.
> nabble.com/Messages-are-stuck-in-ActiveMQ-5-11-and-
> delivered-for-after-more-than-24-hours-tp4727694p4728468.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Re: Messages are stuck in ActiveMQ 5.11 and delivered for after more than 24 hours

Posted by sravan <sr...@gmail.com>.

Unfortunately we are in a predicament where we have issues in reproducing the
problem in performance test environments as well as debugging. ActiveMQ INFO
logs does not contain much information to understand what's going on. When
we turn on DEBUG logging (in lower env) ActiveMQ is hanging just in few
minutes and never comes back. So far the only one clue we are relying on is,
while monitoring ActiveMQ consoles in our prod environment, we observed
messages stuck on network bridge for multiple hours. Also in our research,
we know that consumers are doing fine and the main issue is with ActiveMQ
which is delivering messages very late and we simply don't know the root
cause. When we worked with an ActiveMQ consultant a year ago, we were
advised to scale ActiveMQs vertically rather than scaling them
horizontally. i.e. Have 2 larger AMQ nodes rather than having 4 smaller
nodes in a cluster. We were told that this avoids many potential anomalies
with network brdige and shipping messages across the nodes etc...Does anyone
here have any suggestions with respect to how else we could debug or fix our
issue. Again just to refresh your minds, the main issue we’ve been
struggling with is...Our AMQs are delivering messages extremely late,
sometimes a day or even two days later. We do not see any JMS exceptions,
and no exceptions or issues at consumer's end. Any inputs are greatly
appreciated.

--
View this message in context: http://activemq.2283324.n4.nabble.com/Messages-are-stuck-in-ActiveMQ-5-11-and-delivered-for-after-more-than-24-hours-tp4727694p4728468.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Messages are stuck in ActiveMQ 5.11 and delivered for after more than 24 hours

Posted by Tim Bain <tb...@alumni.duke.edu>.

The most effective way I know to determine what a Java process is doing
when I can't step through with a debugger is to use a CPU sampler
(JVisualVM ships with the Oracle JDK and can attach either locally or from
a remote machine via JMX and RMI) to capture where the time is being spent.
Let it capture data for a few minutes, then take a snapshot and dig into
what the various threads are spending their time on.

Don't use the profiler on an operational system! The sampler is what you
want, since it can give you good insight without measurably degrading
performance, whereas the profiler will grind everything to a halt on a
system as heavily loaded as yours sounds.

Tim

On Jun 21, 2017 1:08 PM, "sravan" <sr...@gmail.com> wrote:

> Our batch processing applications process abut 10 Billion messages a day.
> From past two months we have been experiencing an issue with ActiveMQ where
> ActiveMQ delivers messages very late, sometimes messages are delivered 3
> days later. Daily, in the worst case 5% (and mostly 1%) of messages are
> delivered one day later. We do not have message expiration policy by the
> way
> that controls what should happen when a message is not delivered for a
> certain period of time. We monitored our splunk based logs and do not see
> any exceptions or errors that indicate any issues with consumer
> connections.
> We could not turn on additional logging on ActiveMQ as that will cause a
> huge hit on performance. So we mostly relied on monitoring ActiveMQ
> consoles
> and Dynatarce. There is no server resource utilization issues on AMQs. By
> the way we have 4 AMQ nodes active in the cluster. When we monitored
> activeMQ consoles we saw messages stuck in network bridge sometimes. When I
> say stuck, what I mean is that message draining was extremely slow and in a
> period of 2 hours I noticed only handful of messages getting drained.
> Whenever we restart a AMQ node any stuck messages on that node are getting
> released (I think this is a known fact to all AMQ users).  What is most
> frustrating is, number of stuck messages we have noticed do not correlate
> to
> number of messages delivered lately (i.e. delivered next day). So we are
> under an impression that there could be invisible stuck messages. BTW this
> issue started happening ever since we applied a Linux patch
> (redhat-release-5Server-5.11.0.9 /
>    autofs-5.0.1-0.rc2.186.el5_11) on ActiveMQ nodes. We did some research
> in
> the forums to check if there is any incompatibility between AMQ 5.11 and
> this Linux patch but could not find anything. Does anyone here have any
> ideas suggestions how we can troubleshoot this issue further? Any inputs
> would be greatly appreciated.
>
>
>
> --
> View this message in context: http://activemq.2283324.n4.
> nabble.com/Messages-are-stuck-in-ActiveMQ-5-11-and-
> delivered-for-after-more-than-24-hours-tp4727694.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>