You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Anders Wallgren (Jira)" <ji...@apache.org> on 2021/11/17 17:05:00 UTC

[jira] [Commented] (ARTEMIS-3583) Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause spurious AMQ224013 warnings to be emitted.

    [ https://issues.apache.org/jira/browse/ARTEMIS-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445373#comment-17445373 ] 

Anders Wallgren commented on ARTEMIS-3583:
------------------------------------------

[~clebert.suconic@jboss.com] can you confirm my analysis above, or am I missing something here?

> Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause spurious AMQ224013 warnings to be emitted.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-3583
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.18.0
>         Environment: Large-core machines running embedded ActiveMQ broker.
>            Reporter: Anders Wallgren
>            Priority: Major
>
> A recent change - [https://github.com/apache/activemq-artemis/pull/3654] - appears to introduce a race condition in QueueImpl#expireReferences(Runnable) that can cause the Runnable "done" parameter to never be invoked. This manifests itself as spurious "AMQ224013: failed to expire messages for queue" messages.
> The problem is this section of code and the non-atomic test-and-set on the expiryScanner.scannerRunning AtomicInteger:
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {  <--- "test"
>   if (expiryScanner.scannerRunning.incrementAndGet() == 1) { <--- "and-set"
>     expiryScanner.doneCallback = done;
>   }
>   getExecutor().execute(expiryScanner);
> } else {
>   // expire is already happening on this queue, move on!
>   if (done != null) {
>     done.run();
> }
> {code}
> Consider the following sequence of events and see if you agree with my analysis:
>  # Two threads, t1 and t2
>  # t1 runs PostOfficeImpl.ExpiryReaper which calls QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration is "done". It waits 10 seconds for the latch to drop. If the latch doesn't drop within the timeout it issues a AMQ224013 warning.
>  # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which in turn calls QueueImpl.expireReferences(null)
>  # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they both enter that branch.
>  # t2 successfully increments expiryScanner.scannerRunning to 1 and sets expiryScanner.doneCallback = null (since it passed null as the "done" parameter) and then continues to run expiryScanner.
>  # t1 doesn't increment expiryScanner.scannerRunning and at this point the "done" argument is lost and will never be invoked, guaranteeing the AMQ224013 warning. t1 then also runs the expiryScanner (which has already been submitted once by t2)
> Because of the non-atomic test-and-set, the callback is not guaranteed to be invoked and, also, more than one expiry run will execute simultaneously (this seems to be undesirable based on why this change was made in the first place).
> Should the code not be something like this (I'm not 100% familiar with the semantics of scannerRunning being > 1, so this may not be the correct solution):
>  
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) { <--- test-and-set
>   expiryScanner.doneCallback = done;
>   getExecutor().execute(expiryScanner);
> } else {
>   // expire is already happening on this queue, move on!
>   if (done != null) {
>     done.run();
> }
> {code}
>  
> This guarantees that only one thread can enter the section that sets expiryScanner.doneCallback and submits expiryScanner for execution; all other threads will immediately have their Runnable invoked. 
> We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while testing on very large instances (64 CPUs and up) that are very busy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)