You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@activemq.apache.org by "Timothy Bish (JIRA)" <ji...@apache.org> on 2012/11/05 23:56:12 UTC

[jira] [Resolved] (AMQ-3993) NetworkBridge sometimes stops trying to reconnect after connection is lost

     [ https://issues.apache.org/jira/browse/AMQ-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Bish resolved AMQ-3993.
-------------------------------

    Resolution: Fixed
      Assignee: Timothy Bish

This bit should be fixed by AMQ-4159 and AMQ-4160
                
> NetworkBridge sometimes stops trying to reconnect after connection is lost
> --------------------------------------------------------------------------
>
>                 Key: AMQ-3993
>                 URL: https://issues.apache.org/jira/browse/AMQ-3993
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.6.0
>         Environment: using static:// networkConnector (i.e. SimpleDiscoveryAgent)
>            Reporter: Ron Koerner
>            Assignee: Timothy Bish
>             Fix For: 5.8.0
>
>         Attachments: reconnect-problem-annotated.txt
>
>
> After losing connection due to shutdown of the peer the broker tries to rebuild the connection once, fails again and stops trying afterwards.
> While this also happens with a standard setup, it seems to happen much more often with a certain type of firewall which always accepts a connection, but closes it if the real destination cannot be reached.
> This can be simulated by using a "socat" forwarder between the two brokers.
> The problems seems to lie in the following sequence of events, a race condition and the use of {{event.failed}} in {{SimpleDiscoveryAgent.serviceFailed}} and {{bridges}} in {{DiscoveryNetworkConnector}}:
> # connection "failure" due to ShutdownInfo
> #- event.failed=true
> #- bridge is unregistered
> # start establishing a new connection
> #- event.failed=false
> #- bridge is not yet registered
> # second connection failure of the old connection due to EOF
> #- not blocked, since event.failed==false
> #- event.failed=true
> #- bridge would be unregistered, but currently there is none
> #- wait one second (continued below)
> # new connection is started
> #- bridge is registered
> # receive multiple connection failures of the new connection
> #- all blocked, since event.failed=true
> # continue after one second, try to establish a new connection
> #- blocked, since bridge is already registered
> To fix this problem a NetworkBridge should probably not be allowed to call {{SimpleDiscoveryAgent.serviceFailed}} more than once, since {{event.failed}} cannot keep track of multiple connections at one time.
> The chain of events holds a lot of race conditions. If the second failure of the old connection occurs before the new connection is started (which seems to be the case most of the time) or the new connection's bridge is registered before the EOF occurs, the problem does not manifest.
> Attached is a log excerpt with my comments about the state of event.failed and bridges.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira