You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@activemq.apache.org by sl...@cecurity.com.INVALID on 2022/10/05 14:42:08 UTC

Slow consume issue (was: Re: AMQ142007: Can not find record during compact replay)

Hello,

Thank you for your response. The clients refused the downtime at the
time for the store dump and we ended up with a live cleanup of the queue
that they couldn't consume fast enough (holding ~20M response messages)

The occurrence of that specific WARNING diminished significantly after
that and haven't been seen again in over a week.

Still have a suspected performance issue on that broker though, after a
few hours uptime the consuming of that queue slows and messages pile up.

My other queues on the same broker are at zero most of the time for an
identical or similar volume over the same timeframe.

At some point we had

AMQ222144: Queue could not finish waiting executors. Try increasing the
threadpool size

which we did, by 50% ; the warning disappeared but not the slowing issue.

The specificity of the the usage of that queue is that

1/ consumers use selectors to pick their message of interest

2/ One of the client application create and destroy a lot of short lived
consumers. There may have more than 400 consumers connected at some point.

Is there more specific documentation pointers available on dimensioning
resources on the broker depending on some usage patterns ?

Is there a known issue with performance of selectors on 2.24.0 ? In
particular that may degrade over time.

It's happening even when the queue has not reach the paging threshold
and still fit all in mem and no significant iowait on server side have
been spotted.

Memory heap usage of the VM seems stable and well below allocation.
netty threads not specified on acceptor, default seems to be at 96 and
most of them in epoll wait

Regards,


Le 22/09/2022 à 10:47, Clebert Suconic a écrit :
> I don’t think it’s critical.  But I wanted to check your data to be able to
> help.
> On Thu, Sep 22, 2022 at 4:26 AM Clebert Suconic <cl...@gmail.com>
> wrote:
>
>> Also, inspect the file before you upload it.
>>
>> On Thu, Sep 22, 2022 at 4:25 AM Clebert Suconic <cl...@gmail.com>
>> wrote:
>>
>>> Is there any chance you could add a print data of your journal to the
>>> JIRA?
>>>
>>>
>>> The Artemis data print has a —safe argument that would obfuscate any data
>>> only leaving journal structure recorded.
>>>
>>>
>>> ./artemjs data print —safe
>>>
>>>
>>> The system has to be stopped or you could copy your journal on another
>>> server for this operation.
>>>
>>> On Tue, Sep 20, 2022 at 6:24 AM <sl...@cecurity.com.invalid> wrote:
>>>
>>>> Hello,
>>>>
>>>> This is related to my ticket
>>>> https://issues.apache.org/jira/browse/ARTEMIS-3992
>>>>
>>>> We still have occasional spurts of messages like
>>>>
>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>> AMQ142007: Can not find record 268 566 334 during compact replay
>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>> AMQ142007: Can not find record 268 568 921 during compact replay
>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>> AMQ142007: Can not find record 268 567 786 during compact replay
>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>> AMQ142007: Can not find record 268 569 685 during compact replay
>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>> AMQ142007: Can not find record 268 569 347 during compact replay
>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>> AMQ142007: Can not find record 268 566 991 during compact replay
>>>>
>>>> and some consumers are dead slow (~4msg/s) . It may be their fault but
>>>> they struggle to find a client side cause. The queue they are querying
>>>> now holds 14M msg
>>>>
>>>> I'm trying to determine
>>>>
>>>> a/ the severity of the problem exposed by those warning. Can the store
>>>> be corrupt ? (we don't really have doubts on the physical layer, this is
>>>> local storage on the server, no errors reported by the os)
>>>>
>>>> b/ it there a possibility that theses issues play a role in the
>>>> performance problem ?
>>>>
>>>> Still on 2.24.0 like exposed in the jira issue.
>>>>
>>>> Any ideas ?
>>>>
>>>> Regards,
>>>>
>>>> SL
>>>>
>>> --
>>> Clebert Suconic
>>>
>> --
>> Clebert Suconic
>>

RE: Slow consume issue (was: Re: AMQ142007: Can not find record during compact replay)

Posted by sl...@cecurity.com.INVALID.
Thank you for your response.
Since it seems from the previous messages on this list that slack is the most usual way, maybe send an invite ?
I will try to get the elements, some of them I may not be able to post on a open site/mailling list.
Thanks.

ps : i'm in the CEST timezone.

________________________________
De : Justin Bertram <jb...@apache.org>
Envoyé : mercredi 5 octobre 2022 21:44
À : users@activemq.apache.org <us...@activemq.apache.org>
Objet : Re: Slow consume issue (was: Re: AMQ142007: Can not find record during compact replay)

I'd love to get a look at a few full thread dumps during the slow-down if
possible. Could you upload those somewhere or toss them on pastebin [1] or
gist [2], etc.?

Given that there are no consumers with selectors on the queues having no
problems and that the problem comes and goes along with the application
spawning lots of consumers I'd say that either queue scanning related to
the selectors is the problem (as I described in my previous email) or that
the consumer creation/deletion itself thrashing is causing excessive
contention on the queue.

> It is possible that the open∕close of lots of consumers "wear down" the
broker after some time ?

It's not really about "wearing down" the broker. Creating and closing lots
of consumers on the same queue in rapid succession is just objectively bad
for performance because it creates contention on the queue (which, as
you've seen, impacts all consumers on the queue). I'd say this is an
anti-pattern. Is there a specific reason the application does this?

In your previous email you indicated that there may have been 400 or more
consumers on the same queue. Architecturally speaking this isn't an optimal
design. In general, you'd want to spread your consumers across many
different queues to take advantage of concurrency within the broker rather
than creating a bottleneck by piling so many consumers onto a single queue.


Justin

[1] https://pastebin.com/index.php
[2] https://gist.github.com/

On Wed, Oct 5, 2022 at 12:42 PM <sl...@cecurity.com.invalid> wrote:

> Thank you for your response.
>
> - No selectors on other queues
>
> - I checked the thread list on artemis console. At the time 2 or 3
> threads where in BLOCKED state
>
> for example with this stack :
>
>  1. org.apache.activemq.artemis.core.server.impl.QueueImpl . addConsumer
>     (QueueImpl.java:1414)
>  2. org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl .
>     <init> (ServerConsumerImpl.java:258)
>  3. org.apache.activemq.artemis.core.server.impl.ServerSessionImpl .
>     createConsumer (ServerSessionImpl.java:572)
>  4. org.apache.activemq.artemis.core.protocol.openwire.amq.AMQConsumer .
>     init (AMQConsumer.java:156)
>  5. org.apache.activemq.artemis.core.protocol.openwire.amq.AMQSession .
>     createConsumer (AMQSession.java:190)
>  6. org.apache.activemq.artemis.core.protocol.openwire.OpenWireConnection
>     . addConsumer (OpenWireConnection.java:952)
>  7.
> org.apache.activemq.artemis.core.protocol.openwire.OpenWireConnection$CommandProcessor
>     . processAddConsumer (OpenWireConnection.java:1258)
>  8. org.apache.activemq.command.ConsumerInfo . visit
>     (ConsumerInfo.java:352)
>
> The others were all runnable, waiting or time waiting
>
> - The selectors are not complex imho, they check for a specific header
> for a prefix (<header> like '<prefix>%') and there is no messages
> matching none of the selectors.
>
> - The distribution seems balanced there is no significant gap for one
> type of consumer.
>
> - There is a few CPU spikes but the server is almost always at least
> 85%-90% idle. The broker seems to stay at 100-150% CPU but there is 32
> (logical) cores.
>
>
> Couple of follow up findings :
>
> - after a while the log statements came back despite the thread pool
> increase
>
> AMQ222172 Queue <redacted> was busy for more than 10 000 milliseconds.
> There are possibly consumers hanging on a network operation
>
> [org.apache.activemq.artemis.core.server] AMQ222144: Queue could not
> finish waiting executors. Try increasing the thread pool size
>
> this was several hours after its started to slow down and at a time when
> the consume was totally floored (0 or a really small amount of messages
> per minute)
>
> - At that point the client applications spawning lots of consumers were
> stopped. The other one which use a more common pattern with a static
> pool of 10 consumers came back to life and happily consumed its own
> messages at its better known rate.
>
> When the other client component was restarted, it floored again and the
> thread pool warning came back from time to time
>
> It is possible that the open∕close of lots of consumers "wear down" the
> broker after some time ?
>
> Side note : the most recent client application with the static pool of
> consumers use the core protocol, the legacy one spawning lots of
> consumers use openwire with the classical activemq client lib
>
> We did not identified this pattern when it was on 2.16 (which we moved
> away from since the rollback on timeout on some XA transaction leaded to
> possibly corrupted store and general instability, but on a longer
> timeframe).
>
> Regards,
>
> Le 05/10/2022 à 18:06, Justin Bertram a écrit :
> >> Is there a known issue with performance of selectors on 2.24.0 ? In
> > particular that may degrade over time.
> >
> > I'm not aware of any particular issue, per se, but generally speaking,
> > using selectors on queue consumers is not good for performance. This is
> due
> > to the queue scanning required to match messages to the selector.
> > Performance won't degrade over time necessarily, but it will degrade as
> the
> > complexity of the selector increases or as the number of messages
> increases
> > or as the number of consumers increases (or any combination thereof).
> >
> > Couple of questions:
> >  - Are selectors in use with the same volume of messages and consumers on
> > the queues where you're not seeing any problems?
> >  - Have you gathered any thread dumps to see what the broker is doing
> > during these slow downs? If so, what did you find? If not, could you
> gather
> > a handful of thread dumps so we can know what's happening over time?
> >  - Have you gathered any metrics related to CPU utilization during these
> > slow downs? If so, what did you find?
> >  - Are there any "gaps" in the consumers' selectors such that some
> messages
> > may not actually match any selector and therefore will not be consumed?
> >  - How complex are your selectors?
> >
> > Generally speaking, performance will be better with *topic* subscribers
> > using selectors. This is because the broker pre-sorts matching messages
> > into the corresponding subscription as they arrive at the topic which
> means
> > that the matching logic is performed less often than with the queue
> > use-case.
> >
> > I can't say for sure this is your problem. You may be hitting some other
> > issue. Hopefully your answers to my questions will help narrow it down.
> >
> > FWIW, ActiveMQ Artemis 2.26.0 contains some performance optimizations
> for a
> > few uncommonly complex selector use-cases [1].
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-3962
> >
> > On Wed, Oct 5, 2022 at 9:42 AM <sl...@cecurity.com.invalid> wrote:
> >
> >> Hello,
> >>
> >> Thank you for your response. The clients refused the downtime at the
> >> time for the store dump and we ended up with a live cleanup of the queue
> >> that they couldn't consume fast enough (holding ~20M response messages)
> >>
> >> The occurrence of that specific WARNING diminished significantly after
> >> that and haven't been seen again in over a week.
> >>
> >> Still have a suspected performance issue on that broker though, after a
> >> few hours uptime the consuming of that queue slows and messages pile up.
> >>
> >> My other queues on the same broker are at zero most of the time for an
> >> identical or similar volume over the same timeframe.
> >>
> >> At some point we had
> >>
> >> AMQ222144: Queue could not finish waiting executors. Try increasing the
> >> threadpool size
> >>
> >> which we did, by 50% ; the warning disappeared but not the slowing
> issue.
> >>
> >> The specificity of the the usage of that queue is that
> >>
> >> 1/ consumers use selectors to pick their message of interest
> >>
> >> 2/ One of the client application create and destroy a lot of short lived
> >> consumers. There may have more than 400 consumers connected at some
> point.
> >>
> >> Is there more specific documentation pointers available on dimensioning
> >> resources on the broker depending on some usage patterns ?
> >>
> >> Is there a known issue with performance of selectors on 2.24.0 ? In
> >> particular that may degrade over time.
> >>
> >> It's happening even when the queue has not reach the paging threshold
> >> and still fit all in mem and no significant iowait on server side have
> >> been spotted.
> >>
> >> Memory heap usage of the VM seems stable and well below allocation.
> >> netty threads not specified on acceptor, default seems to be at 96 and
> >> most of them in epoll wait
> >>
> >> Regards,
> >>
> >>
> >> Le 22/09/2022 à 10:47, Clebert Suconic a écrit :
> >>> I don’t think it’s critical.  But I wanted to check your data to be
> able
> >> to
> >>> help.
> >>> On Thu, Sep 22, 2022 at 4:26 AM Clebert Suconic <
> >> clebert.suconic@gmail.com>
> >>> wrote:
> >>>
> >>>> Also, inspect the file before you upload it.
> >>>>
> >>>> On Thu, Sep 22, 2022 at 4:25 AM Clebert Suconic <
> >> clebert.suconic@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Is there any chance you could add a print data of your journal to the
> >>>>> JIRA?
> >>>>>
> >>>>>
> >>>>> The Artemis data print has a —safe argument that would obfuscate any
> >> data
> >>>>> only leaving journal structure recorded.
> >>>>>
> >>>>>
> >>>>> ./artemjs data print —safe
> >>>>>
> >>>>>
> >>>>> The system has to be stopped or you could copy your journal on
> another
> >>>>> server for this operation.
> >>>>>
> >>>>> On Tue, Sep 20, 2022 at 6:24 AM <sl...@cecurity.com.invalid> wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> This is related to my ticket
> >>>>>> https://issues.apache.org/jira/browse/ARTEMIS-3992
> >>>>>>
> >>>>>> We still have occasional spurts of messages like
> >>>>>>
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 566 334 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 568 921 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 567 786 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 569 685 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 569 347 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 566 991 during compact replay
> >>>>>>
> >>>>>> and some consumers are dead slow (~4msg/s) . It may be their fault
> but
> >>>>>> they struggle to find a client side cause. The queue they are
> querying
> >>>>>> now holds 14M msg
> >>>>>>
> >>>>>> I'm trying to determine
> >>>>>>
> >>>>>> a/ the severity of the problem exposed by those warning. Can the
> store
> >>>>>> be corrupt ? (we don't really have doubts on the physical layer,
> this
> >> is
> >>>>>> local storage on the server, no errors reported by the os)
> >>>>>>
> >>>>>> b/ it there a possibility that theses issues play a role in the
> >>>>>> performance problem ?
> >>>>>>
> >>>>>> Still on 2.24.0 like exposed in the jira issue.
> >>>>>>
> >>>>>> Any ideas ?
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> SL
> >>>>>>
> >>>>> --
> >>>>> Clebert Suconic
> >>>>>
> >>>> --
> >>>> Clebert Suconic
> >>>>

Re: Slow consume issue (was: Re: AMQ142007: Can not find record during compact replay)

Posted by Justin Bertram <jb...@apache.org>.
I'd love to get a look at a few full thread dumps during the slow-down if
possible. Could you upload those somewhere or toss them on pastebin [1] or
gist [2], etc.?

Given that there are no consumers with selectors on the queues having no
problems and that the problem comes and goes along with the application
spawning lots of consumers I'd say that either queue scanning related to
the selectors is the problem (as I described in my previous email) or that
the consumer creation/deletion itself thrashing is causing excessive
contention on the queue.

> It is possible that the open∕close of lots of consumers "wear down" the
broker after some time ?

It's not really about "wearing down" the broker. Creating and closing lots
of consumers on the same queue in rapid succession is just objectively bad
for performance because it creates contention on the queue (which, as
you've seen, impacts all consumers on the queue). I'd say this is an
anti-pattern. Is there a specific reason the application does this?

In your previous email you indicated that there may have been 400 or more
consumers on the same queue. Architecturally speaking this isn't an optimal
design. In general, you'd want to spread your consumers across many
different queues to take advantage of concurrency within the broker rather
than creating a bottleneck by piling so many consumers onto a single queue.


Justin

[1] https://pastebin.com/index.php
[2] https://gist.github.com/

On Wed, Oct 5, 2022 at 12:42 PM <sl...@cecurity.com.invalid> wrote:

> Thank you for your response.
>
> - No selectors on other queues
>
> - I checked the thread list on artemis console. At the time 2 or 3
> threads where in BLOCKED state
>
> for example with this stack :
>
>  1. org.apache.activemq.artemis.core.server.impl.QueueImpl . addConsumer
>     (QueueImpl.java:1414)
>  2. org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl .
>     <init> (ServerConsumerImpl.java:258)
>  3. org.apache.activemq.artemis.core.server.impl.ServerSessionImpl .
>     createConsumer (ServerSessionImpl.java:572)
>  4. org.apache.activemq.artemis.core.protocol.openwire.amq.AMQConsumer .
>     init (AMQConsumer.java:156)
>  5. org.apache.activemq.artemis.core.protocol.openwire.amq.AMQSession .
>     createConsumer (AMQSession.java:190)
>  6. org.apache.activemq.artemis.core.protocol.openwire.OpenWireConnection
>     . addConsumer (OpenWireConnection.java:952)
>  7.
> org.apache.activemq.artemis.core.protocol.openwire.OpenWireConnection$CommandProcessor
>     . processAddConsumer (OpenWireConnection.java:1258)
>  8. org.apache.activemq.command.ConsumerInfo . visit
>     (ConsumerInfo.java:352)
>
> The others were all runnable, waiting or time waiting
>
> - The selectors are not complex imho, they check for a specific header
> for a prefix (<header> like '<prefix>%') and there is no messages
> matching none of the selectors.
>
> - The distribution seems balanced there is no significant gap for one
> type of consumer.
>
> - There is a few CPU spikes but the server is almost always at least
> 85%-90% idle. The broker seems to stay at 100-150% CPU but there is 32
> (logical) cores.
>
>
> Couple of follow up findings :
>
> - after a while the log statements came back despite the thread pool
> increase
>
> AMQ222172 Queue <redacted> was busy for more than 10 000 milliseconds.
> There are possibly consumers hanging on a network operation
>
> [org.apache.activemq.artemis.core.server] AMQ222144: Queue could not
> finish waiting executors. Try increasing the thread pool size
>
> this was several hours after its started to slow down and at a time when
> the consume was totally floored (0 or a really small amount of messages
> per minute)
>
> - At that point the client applications spawning lots of consumers were
> stopped. The other one which use a more common pattern with a static
> pool of 10 consumers came back to life and happily consumed its own
> messages at its better known rate.
>
> When the other client component was restarted, it floored again and the
> thread pool warning came back from time to time
>
> It is possible that the open∕close of lots of consumers "wear down" the
> broker after some time ?
>
> Side note : the most recent client application with the static pool of
> consumers use the core protocol, the legacy one spawning lots of
> consumers use openwire with the classical activemq client lib
>
> We did not identified this pattern when it was on 2.16 (which we moved
> away from since the rollback on timeout on some XA transaction leaded to
> possibly corrupted store and general instability, but on a longer
> timeframe).
>
> Regards,
>
> Le 05/10/2022 à 18:06, Justin Bertram a écrit :
> >> Is there a known issue with performance of selectors on 2.24.0 ? In
> > particular that may degrade over time.
> >
> > I'm not aware of any particular issue, per se, but generally speaking,
> > using selectors on queue consumers is not good for performance. This is
> due
> > to the queue scanning required to match messages to the selector.
> > Performance won't degrade over time necessarily, but it will degrade as
> the
> > complexity of the selector increases or as the number of messages
> increases
> > or as the number of consumers increases (or any combination thereof).
> >
> > Couple of questions:
> >  - Are selectors in use with the same volume of messages and consumers on
> > the queues where you're not seeing any problems?
> >  - Have you gathered any thread dumps to see what the broker is doing
> > during these slow downs? If so, what did you find? If not, could you
> gather
> > a handful of thread dumps so we can know what's happening over time?
> >  - Have you gathered any metrics related to CPU utilization during these
> > slow downs? If so, what did you find?
> >  - Are there any "gaps" in the consumers' selectors such that some
> messages
> > may not actually match any selector and therefore will not be consumed?
> >  - How complex are your selectors?
> >
> > Generally speaking, performance will be better with *topic* subscribers
> > using selectors. This is because the broker pre-sorts matching messages
> > into the corresponding subscription as they arrive at the topic which
> means
> > that the matching logic is performed less often than with the queue
> > use-case.
> >
> > I can't say for sure this is your problem. You may be hitting some other
> > issue. Hopefully your answers to my questions will help narrow it down.
> >
> > FWIW, ActiveMQ Artemis 2.26.0 contains some performance optimizations
> for a
> > few uncommonly complex selector use-cases [1].
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-3962
> >
> > On Wed, Oct 5, 2022 at 9:42 AM <sl...@cecurity.com.invalid> wrote:
> >
> >> Hello,
> >>
> >> Thank you for your response. The clients refused the downtime at the
> >> time for the store dump and we ended up with a live cleanup of the queue
> >> that they couldn't consume fast enough (holding ~20M response messages)
> >>
> >> The occurrence of that specific WARNING diminished significantly after
> >> that and haven't been seen again in over a week.
> >>
> >> Still have a suspected performance issue on that broker though, after a
> >> few hours uptime the consuming of that queue slows and messages pile up.
> >>
> >> My other queues on the same broker are at zero most of the time for an
> >> identical or similar volume over the same timeframe.
> >>
> >> At some point we had
> >>
> >> AMQ222144: Queue could not finish waiting executors. Try increasing the
> >> threadpool size
> >>
> >> which we did, by 50% ; the warning disappeared but not the slowing
> issue.
> >>
> >> The specificity of the the usage of that queue is that
> >>
> >> 1/ consumers use selectors to pick their message of interest
> >>
> >> 2/ One of the client application create and destroy a lot of short lived
> >> consumers. There may have more than 400 consumers connected at some
> point.
> >>
> >> Is there more specific documentation pointers available on dimensioning
> >> resources on the broker depending on some usage patterns ?
> >>
> >> Is there a known issue with performance of selectors on 2.24.0 ? In
> >> particular that may degrade over time.
> >>
> >> It's happening even when the queue has not reach the paging threshold
> >> and still fit all in mem and no significant iowait on server side have
> >> been spotted.
> >>
> >> Memory heap usage of the VM seems stable and well below allocation.
> >> netty threads not specified on acceptor, default seems to be at 96 and
> >> most of them in epoll wait
> >>
> >> Regards,
> >>
> >>
> >> Le 22/09/2022 à 10:47, Clebert Suconic a écrit :
> >>> I don’t think it’s critical.  But I wanted to check your data to be
> able
> >> to
> >>> help.
> >>> On Thu, Sep 22, 2022 at 4:26 AM Clebert Suconic <
> >> clebert.suconic@gmail.com>
> >>> wrote:
> >>>
> >>>> Also, inspect the file before you upload it.
> >>>>
> >>>> On Thu, Sep 22, 2022 at 4:25 AM Clebert Suconic <
> >> clebert.suconic@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Is there any chance you could add a print data of your journal to the
> >>>>> JIRA?
> >>>>>
> >>>>>
> >>>>> The Artemis data print has a —safe argument that would obfuscate any
> >> data
> >>>>> only leaving journal structure recorded.
> >>>>>
> >>>>>
> >>>>> ./artemjs data print —safe
> >>>>>
> >>>>>
> >>>>> The system has to be stopped or you could copy your journal on
> another
> >>>>> server for this operation.
> >>>>>
> >>>>> On Tue, Sep 20, 2022 at 6:24 AM <sl...@cecurity.com.invalid> wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> This is related to my ticket
> >>>>>> https://issues.apache.org/jira/browse/ARTEMIS-3992
> >>>>>>
> >>>>>> We still have occasional spurts of messages like
> >>>>>>
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 566 334 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 568 921 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 567 786 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 569 685 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 569 347 during compact replay
> >>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>>>> AMQ142007: Can not find record 268 566 991 during compact replay
> >>>>>>
> >>>>>> and some consumers are dead slow (~4msg/s) . It may be their fault
> but
> >>>>>> they struggle to find a client side cause. The queue they are
> querying
> >>>>>> now holds 14M msg
> >>>>>>
> >>>>>> I'm trying to determine
> >>>>>>
> >>>>>> a/ the severity of the problem exposed by those warning. Can the
> store
> >>>>>> be corrupt ? (we don't really have doubts on the physical layer,
> this
> >> is
> >>>>>> local storage on the server, no errors reported by the os)
> >>>>>>
> >>>>>> b/ it there a possibility that theses issues play a role in the
> >>>>>> performance problem ?
> >>>>>>
> >>>>>> Still on 2.24.0 like exposed in the jira issue.
> >>>>>>
> >>>>>> Any ideas ?
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> SL
> >>>>>>
> >>>>> --
> >>>>> Clebert Suconic
> >>>>>
> >>>> --
> >>>> Clebert Suconic
> >>>>

Re: Slow consume issue (was: Re: AMQ142007: Can not find record during compact replay)

Posted by sl...@cecurity.com.INVALID.
Thank you for your response.

- No selectors on other queues

- I checked the thread list on artemis console. At the time 2 or 3
threads where in BLOCKED state

for example with this stack :

 1. org.apache.activemq.artemis.core.server.impl.QueueImpl . addConsumer
    (QueueImpl.java:1414)
 2. org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl .
    <init> (ServerConsumerImpl.java:258)
 3. org.apache.activemq.artemis.core.server.impl.ServerSessionImpl .
    createConsumer (ServerSessionImpl.java:572)
 4. org.apache.activemq.artemis.core.protocol.openwire.amq.AMQConsumer .
    init (AMQConsumer.java:156)
 5. org.apache.activemq.artemis.core.protocol.openwire.amq.AMQSession .
    createConsumer (AMQSession.java:190)
 6. org.apache.activemq.artemis.core.protocol.openwire.OpenWireConnection
    . addConsumer (OpenWireConnection.java:952)
 7. org.apache.activemq.artemis.core.protocol.openwire.OpenWireConnection$CommandProcessor
    . processAddConsumer (OpenWireConnection.java:1258)
 8. org.apache.activemq.command.ConsumerInfo . visit
    (ConsumerInfo.java:352)

The others were all runnable, waiting or time waiting

- The selectors are not complex imho, they check for a specific header
for a prefix (<header> like '<prefix>%') and there is no messages
matching none of the selectors.

- The distribution seems balanced there is no significant gap for one
type of consumer.

- There is a few CPU spikes but the server is almost always at least
85%-90% idle. The broker seems to stay at 100-150% CPU but there is 32
(logical) cores.


Couple of follow up findings :

- after a while the log statements came back despite the thread pool
increase

AMQ222172 Queue <redacted> was busy for more than 10 000 milliseconds.
There are possibly consumers hanging on a network operation

[org.apache.activemq.artemis.core.server] AMQ222144: Queue could not
finish waiting executors. Try increasing the thread pool size

this was several hours after its started to slow down and at a time when
the consume was totally floored (0 or a really small amount of messages
per minute)

- At that point the client applications spawning lots of consumers were
stopped. The other one which use a more common pattern with a static
pool of 10 consumers came back to life and happily consumed its own
messages at its better known rate.

When the other client component was restarted, it floored again and the
thread pool warning came back from time to time

It is possible that the open∕close of lots of consumers "wear down" the
broker after some time ?

Side note : the most recent client application with the static pool of
consumers use the core protocol, the legacy one spawning lots of
consumers use openwire with the classical activemq client lib

We did not identified this pattern when it was on 2.16 (which we moved
away from since the rollback on timeout on some XA transaction leaded to
possibly corrupted store and general instability, but on a longer
timeframe).

Regards,

Le 05/10/2022 à 18:06, Justin Bertram a écrit :
>> Is there a known issue with performance of selectors on 2.24.0 ? In
> particular that may degrade over time.
>
> I'm not aware of any particular issue, per se, but generally speaking,
> using selectors on queue consumers is not good for performance. This is due
> to the queue scanning required to match messages to the selector.
> Performance won't degrade over time necessarily, but it will degrade as the
> complexity of the selector increases or as the number of messages increases
> or as the number of consumers increases (or any combination thereof).
>
> Couple of questions:
>  - Are selectors in use with the same volume of messages and consumers on
> the queues where you're not seeing any problems?
>  - Have you gathered any thread dumps to see what the broker is doing
> during these slow downs? If so, what did you find? If not, could you gather
> a handful of thread dumps so we can know what's happening over time?
>  - Have you gathered any metrics related to CPU utilization during these
> slow downs? If so, what did you find?
>  - Are there any "gaps" in the consumers' selectors such that some messages
> may not actually match any selector and therefore will not be consumed?
>  - How complex are your selectors?
>
> Generally speaking, performance will be better with *topic* subscribers
> using selectors. This is because the broker pre-sorts matching messages
> into the corresponding subscription as they arrive at the topic which means
> that the matching logic is performed less often than with the queue
> use-case.
>
> I can't say for sure this is your problem. You may be hitting some other
> issue. Hopefully your answers to my questions will help narrow it down.
>
> FWIW, ActiveMQ Artemis 2.26.0 contains some performance optimizations for a
> few uncommonly complex selector use-cases [1].
>
>
> Justin
>
> [1] https://issues.apache.org/jira/browse/ARTEMIS-3962
>
> On Wed, Oct 5, 2022 at 9:42 AM <sl...@cecurity.com.invalid> wrote:
>
>> Hello,
>>
>> Thank you for your response. The clients refused the downtime at the
>> time for the store dump and we ended up with a live cleanup of the queue
>> that they couldn't consume fast enough (holding ~20M response messages)
>>
>> The occurrence of that specific WARNING diminished significantly after
>> that and haven't been seen again in over a week.
>>
>> Still have a suspected performance issue on that broker though, after a
>> few hours uptime the consuming of that queue slows and messages pile up.
>>
>> My other queues on the same broker are at zero most of the time for an
>> identical or similar volume over the same timeframe.
>>
>> At some point we had
>>
>> AMQ222144: Queue could not finish waiting executors. Try increasing the
>> threadpool size
>>
>> which we did, by 50% ; the warning disappeared but not the slowing issue.
>>
>> The specificity of the the usage of that queue is that
>>
>> 1/ consumers use selectors to pick their message of interest
>>
>> 2/ One of the client application create and destroy a lot of short lived
>> consumers. There may have more than 400 consumers connected at some point.
>>
>> Is there more specific documentation pointers available on dimensioning
>> resources on the broker depending on some usage patterns ?
>>
>> Is there a known issue with performance of selectors on 2.24.0 ? In
>> particular that may degrade over time.
>>
>> It's happening even when the queue has not reach the paging threshold
>> and still fit all in mem and no significant iowait on server side have
>> been spotted.
>>
>> Memory heap usage of the VM seems stable and well below allocation.
>> netty threads not specified on acceptor, default seems to be at 96 and
>> most of them in epoll wait
>>
>> Regards,
>>
>>
>> Le 22/09/2022 à 10:47, Clebert Suconic a écrit :
>>> I don’t think it’s critical.  But I wanted to check your data to be able
>> to
>>> help.
>>> On Thu, Sep 22, 2022 at 4:26 AM Clebert Suconic <
>> clebert.suconic@gmail.com>
>>> wrote:
>>>
>>>> Also, inspect the file before you upload it.
>>>>
>>>> On Thu, Sep 22, 2022 at 4:25 AM Clebert Suconic <
>> clebert.suconic@gmail.com>
>>>> wrote:
>>>>
>>>>> Is there any chance you could add a print data of your journal to the
>>>>> JIRA?
>>>>>
>>>>>
>>>>> The Artemis data print has a —safe argument that would obfuscate any
>> data
>>>>> only leaving journal structure recorded.
>>>>>
>>>>>
>>>>> ./artemjs data print —safe
>>>>>
>>>>>
>>>>> The system has to be stopped or you could copy your journal on another
>>>>> server for this operation.
>>>>>
>>>>> On Tue, Sep 20, 2022 at 6:24 AM <sl...@cecurity.com.invalid> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> This is related to my ticket
>>>>>> https://issues.apache.org/jira/browse/ARTEMIS-3992
>>>>>>
>>>>>> We still have occasional spurts of messages like
>>>>>>
>>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>>>> AMQ142007: Can not find record 268 566 334 during compact replay
>>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>>>> AMQ142007: Can not find record 268 568 921 during compact replay
>>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>>>> AMQ142007: Can not find record 268 567 786 during compact replay
>>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>>>> AMQ142007: Can not find record 268 569 685 during compact replay
>>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>>>> AMQ142007: Can not find record 268 569 347 during compact replay
>>>>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
>>>>>> AMQ142007: Can not find record 268 566 991 during compact replay
>>>>>>
>>>>>> and some consumers are dead slow (~4msg/s) . It may be their fault but
>>>>>> they struggle to find a client side cause. The queue they are querying
>>>>>> now holds 14M msg
>>>>>>
>>>>>> I'm trying to determine
>>>>>>
>>>>>> a/ the severity of the problem exposed by those warning. Can the store
>>>>>> be corrupt ? (we don't really have doubts on the physical layer, this
>> is
>>>>>> local storage on the server, no errors reported by the os)
>>>>>>
>>>>>> b/ it there a possibility that theses issues play a role in the
>>>>>> performance problem ?
>>>>>>
>>>>>> Still on 2.24.0 like exposed in the jira issue.
>>>>>>
>>>>>> Any ideas ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> SL
>>>>>>
>>>>> --
>>>>> Clebert Suconic
>>>>>
>>>> --
>>>> Clebert Suconic
>>>>

Re: Slow consume issue (was: Re: AMQ142007: Can not find record during compact replay)

Posted by Justin Bertram <jb...@apache.org>.
> Is there a known issue with performance of selectors on 2.24.0 ? In
particular that may degrade over time.

I'm not aware of any particular issue, per se, but generally speaking,
using selectors on queue consumers is not good for performance. This is due
to the queue scanning required to match messages to the selector.
Performance won't degrade over time necessarily, but it will degrade as the
complexity of the selector increases or as the number of messages increases
or as the number of consumers increases (or any combination thereof).

Couple of questions:
 - Are selectors in use with the same volume of messages and consumers on
the queues where you're not seeing any problems?
 - Have you gathered any thread dumps to see what the broker is doing
during these slow downs? If so, what did you find? If not, could you gather
a handful of thread dumps so we can know what's happening over time?
 - Have you gathered any metrics related to CPU utilization during these
slow downs? If so, what did you find?
 - Are there any "gaps" in the consumers' selectors such that some messages
may not actually match any selector and therefore will not be consumed?
 - How complex are your selectors?

Generally speaking, performance will be better with *topic* subscribers
using selectors. This is because the broker pre-sorts matching messages
into the corresponding subscription as they arrive at the topic which means
that the matching logic is performed less often than with the queue
use-case.

I can't say for sure this is your problem. You may be hitting some other
issue. Hopefully your answers to my questions will help narrow it down.

FWIW, ActiveMQ Artemis 2.26.0 contains some performance optimizations for a
few uncommonly complex selector use-cases [1].


Justin

[1] https://issues.apache.org/jira/browse/ARTEMIS-3962

On Wed, Oct 5, 2022 at 9:42 AM <sl...@cecurity.com.invalid> wrote:

> Hello,
>
> Thank you for your response. The clients refused the downtime at the
> time for the store dump and we ended up with a live cleanup of the queue
> that they couldn't consume fast enough (holding ~20M response messages)
>
> The occurrence of that specific WARNING diminished significantly after
> that and haven't been seen again in over a week.
>
> Still have a suspected performance issue on that broker though, after a
> few hours uptime the consuming of that queue slows and messages pile up.
>
> My other queues on the same broker are at zero most of the time for an
> identical or similar volume over the same timeframe.
>
> At some point we had
>
> AMQ222144: Queue could not finish waiting executors. Try increasing the
> threadpool size
>
> which we did, by 50% ; the warning disappeared but not the slowing issue.
>
> The specificity of the the usage of that queue is that
>
> 1/ consumers use selectors to pick their message of interest
>
> 2/ One of the client application create and destroy a lot of short lived
> consumers. There may have more than 400 consumers connected at some point.
>
> Is there more specific documentation pointers available on dimensioning
> resources on the broker depending on some usage patterns ?
>
> Is there a known issue with performance of selectors on 2.24.0 ? In
> particular that may degrade over time.
>
> It's happening even when the queue has not reach the paging threshold
> and still fit all in mem and no significant iowait on server side have
> been spotted.
>
> Memory heap usage of the VM seems stable and well below allocation.
> netty threads not specified on acceptor, default seems to be at 96 and
> most of them in epoll wait
>
> Regards,
>
>
> Le 22/09/2022 à 10:47, Clebert Suconic a écrit :
> > I don’t think it’s critical.  But I wanted to check your data to be able
> to
> > help.
> > On Thu, Sep 22, 2022 at 4:26 AM Clebert Suconic <
> clebert.suconic@gmail.com>
> > wrote:
> >
> >> Also, inspect the file before you upload it.
> >>
> >> On Thu, Sep 22, 2022 at 4:25 AM Clebert Suconic <
> clebert.suconic@gmail.com>
> >> wrote:
> >>
> >>> Is there any chance you could add a print data of your journal to the
> >>> JIRA?
> >>>
> >>>
> >>> The Artemis data print has a —safe argument that would obfuscate any
> data
> >>> only leaving journal structure recorded.
> >>>
> >>>
> >>> ./artemjs data print —safe
> >>>
> >>>
> >>> The system has to be stopped or you could copy your journal on another
> >>> server for this operation.
> >>>
> >>> On Tue, Sep 20, 2022 at 6:24 AM <sl...@cecurity.com.invalid> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> This is related to my ticket
> >>>> https://issues.apache.org/jira/browse/ARTEMIS-3992
> >>>>
> >>>> We still have occasional spurts of messages like
> >>>>
> >>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>> AMQ142007: Can not find record 268 566 334 during compact replay
> >>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>> AMQ142007: Can not find record 268 568 921 during compact replay
> >>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>> AMQ142007: Can not find record 268 567 786 during compact replay
> >>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>> AMQ142007: Can not find record 268 569 685 during compact replay
> >>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>> AMQ142007: Can not find record 268 569 347 during compact replay
> >>>> 2022-09-20 10:32:43,913 WARN  [org.apache.activemq.artemis.journal]
> >>>> AMQ142007: Can not find record 268 566 991 during compact replay
> >>>>
> >>>> and some consumers are dead slow (~4msg/s) . It may be their fault but
> >>>> they struggle to find a client side cause. The queue they are querying
> >>>> now holds 14M msg
> >>>>
> >>>> I'm trying to determine
> >>>>
> >>>> a/ the severity of the problem exposed by those warning. Can the store
> >>>> be corrupt ? (we don't really have doubts on the physical layer, this
> is
> >>>> local storage on the server, no errors reported by the os)
> >>>>
> >>>> b/ it there a possibility that theses issues play a role in the
> >>>> performance problem ?
> >>>>
> >>>> Still on 2.24.0 like exposed in the jira issue.
> >>>>
> >>>> Any ideas ?
> >>>>
> >>>> Regards,
> >>>>
> >>>> SL
> >>>>
> >>> --
> >>> Clebert Suconic
> >>>
> >> --
> >> Clebert Suconic
> >>