You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Bruce Dodson (Jira)" <ji...@apache.org> on 2021/08/13 21:07:00 UTC

[jira] [Commented] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

    [ https://issues.apache.org/jira/browse/AMQNET-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398916#comment-17398916 ] 

Bruce Dodson commented on AMQNET-656:
-------------------------------------

In follow-up, as per discussion in the PR, it appears the existing code does not violate rules for locks with async code after all, and could not be the explanation of the stall I observed. Thus, in theory, the Task.Run wrapper should not be needed. At best it is recognized as a brute force workaround that may or may not have made any difference.

I also noted that a bug in the underlying AMQPNetLite, now fixed as of version 2.4.3, has been identified as the probable cause of the stall, due to dropped events to trigger failover / recovery: [Azure/amqpnetlite#460|https://github.com/Azure/amqpnetlite/issues/460] "NotifyClosed not being called on connection loss detected by HeartBeat..."

If so, the issue should no longer occur once 1.8.2 is released, as it will then reference AMQPNetLite 2.4.3+. Likewise, in theory, even before 1.8.2 is released, a workaround could be to reference AMQPNetLite 2.4.3 explicitly from NuGet.

I was not able to construct a test that would consistently reproduce the original issue. However, I will be switching one of our long-running test environments back from openwire to AMQP, with AMQPNetLite added as an explicit dependency and updated to 2.4.3, and hopefully will not see the stall again.

 

> AMQP failover implementation fails to reconnect in some cases
> -------------------------------------------------------------
>
>                 Key: AMQNET-656
>                 URL: https://issues.apache.org/jira/browse/AMQNET-656
>             Project: ActiveMQ .Net
>          Issue Type: Bug
>          Components: AMQP
>    Affects Versions: AMQP-1.8.1
>            Reporter: Bruce Dodson
>            Priority: Major
>          Time Spent: 8h
>  Remaining Estimate: 0h
>
> We recently had an issue where some of our producer instances were able to reconnect immediately after the master/slave (or primary/standby) broker cluster failed over, while others never reconnected.
> It appears to be related to two existing JIRAs, AMQNET-624 (addressed in GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed without any changes).
> Regarding the bug identified and fixed in AMQNET-624, part of the original solution was pulled back: where TriggerReconnectionAttempt was called via Task.Run, to instead call it directly. The second issue, AMQNET-626 was to raise concern about the unawaited task returned by TriggerReconnectionAttempt.
> I think perhaps calling from Task.Run may have been beneficial after all: it ensured that TriggerReconnectionAttempt was running on a thread from the thread pool. Otherwise, when TriggerReconnectionAttempt calls ScheduleReconnect, and ScheduleReconnect does an await, that is occurring from within a lock statement.
> As noted in MSDN, an await statement _cannot_ occur inside of a lock statement. That includes anywhere in the call stack, as far as I understand. If you do, it is not caught by the compiler, but can lead to failures e.g. where the task being awaited never gets scheduled.
> Invoking TriggerReconnectionAttempt from a thread pool thread (or another background thread) is one way to avoid this issue, and using Task.Run() might be the easiest way, even though it may also raise eyebrows. Any performance overhead of Task.Run() shouldn't be a factor, since it is only invoked upon losing connection, not continuously.
> The call to Task.Run() could also be moved into TriggerReconnectionAttempt, like so:
> {code:java}
> // this is invoked using Task.Run, to ensure it runs on a thread pool thread
> // in case this was invoked from inside a lock statement (which it is)
> return Task.Run(async () => await reconnectControl.ScheduleReconnect(Reconnect));{code}
> It does still leave the issue identified in AMQNET-626, that the result is not checked, but it resolves the failover failure caused by calling await inside of a lock.
> (Another way around this would be to use a SemaphoreSlim, or other async-compatible synchronization mechanism instead of a lock statement. However, that could have far-reaching implications, since lock statements are used in many parts of the AMQP implementation.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)