You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@activemq.apache.org by "Stirling Chow (JIRA)" <ji...@apache.org> on 2012/11/04 20:20:12 UTC

[jira] [Created] (AMQ-4160) DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Stirling Chow created AMQ-4160:
----------------------------------

             Summary: DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
                 Key: AMQ-4160
                 URL: https://issues.apache.org/jira/browse/AMQ-4160
             Project: ActiveMQ
          Issue Type: Bug
            Reporter: Stirling Chow
            Priority: Critical
             Fix For: 5.8.0


Symptom
=======
{{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.

Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:

# A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
# A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages

Cause
=====
{{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:

{code:title=DiscoveryNetworkConnector.java}
public void onServiceAdd(DiscoveryEvent event) {
...
        // Should we try to connect to that URI?
        synchronized (bridges) {
            if( bridges.containsKey(uri) ) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
                }
                return;
            }
        }
...
        NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
        try {
            bridge.start();
            synchronized (bridges) {
                bridges.put(uri, bridge);
            }
...
}


public void onServiceRemove(DiscoveryEvent event) {
    String url = event.getServiceName();
    if (url != null) {
        URI uri;
        try {
            uri = new URI(url);
        } catch (URISyntaxException e) {
            LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
            return;
        }

        synchronized (bridges) {
            bridges.remove(uri);
        }
    }
}
{code}

There are a number of problems:

# The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
# The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.

The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).

Solution
========
The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow updated AMQ-4160:
-------------------------------

    Attachment: AMQ4160_2.patch

Revised patch to address handleStop behaviour.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160_2.patch, AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Timothy Bish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Bish resolved AMQ-4160.
-------------------------------

    Resolution: Fixed

I fixed up the discovery connector code to properly remove the uri and disabled one of the tests as it was not able to properly test the scenario any longer. 
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow reopened AMQ-4160:
--------------------------------

    Regression: Unit Test Broken

Reopening as I've found two problems with the patch as applied:

+            // Should we try to connect to that URI?
+            if (activeEvents.putIfAbsent(uri, event) != null) {
+                LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
+            }

---> there should be a "return" statement after the LOG.debug

+            // Only remove bridge if this is the active discovery event for the URL.
+            if (activeEvents.remove(url, event)) {
+                synchronized (bridges) {
+                    bridges.remove(uri);
+                }

---> the activeEvents.remove(url, event)) should be activeEvents.remove(uri, event) (i.e., uri instead of url)

The existing AMQ4160Test.java will not pass with these updates since it demonstrates the original bug, which was based on concurrent attempts being allowed to occur.  The original patch was still allowing concurrent attempts.  With these updated changes, concurrent attempts are disallowed so AMQ4160Test.java will fail --- I'll have to provide an updated unit test to account for this behaviour.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506927#comment-13506927 ] 

Stirling Chow commented on AMQ-4160:
------------------------------------

Great to hear, Timothy.  Thanks for working through this with me.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160_2.patch, AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Timothy Bish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Bish resolved AMQ-4160.
-------------------------------

    Resolution: Fixed

Needed to confirm that this didn't break any tests and there's currently some instability in some of the network tests.  Looks good, nice work.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (AMQ-4160) DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Timothy Bish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Bish reassigned AMQ-4160:
---------------------------------

    Assignee: Timothy Bish
    
> DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AMQ-4160) DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490726#comment-13490726 ] 

Stirling Chow commented on AMQ-4160:
------------------------------------

Copying this comment through from AMQ-4159 since it is salient to the fix described in AMQ-4160:

Thanks for the fast turnaround, Timothy. I linked AMQ-4159 and AMQ-4160 together as they are related (at least the symptoms were observed during the same test that we were running). Now that I think about it, the fix for AMQ-4160 relies on AMQ-4159 (i.e., one-to-one relationship between DiscoveryEvent and a bridge creation attempt). It would be great if AMQ-4160 could also be fixed for AMQ 5.8.0.

                
> DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AMQ-4160) DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow updated AMQ-4160:
-------------------------------

    Attachment: AMQ4160.patch

Patch to address the reported issue.
                
> DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AMQ-4160) DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow updated AMQ-4160:
-------------------------------

    Description: 
Symptom
=======
{{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.

Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:

# A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
# A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages

Cause
=====
{{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:

{code:title=DiscoveryNetworkConnector.java}
public void onServiceAdd(DiscoveryEvent event) {
...
        // Should we try to connect to that URI?
        synchronized (bridges) {
            if( bridges.containsKey(uri) ) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
                }
                return;
            }
        }
...
        NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
        try {
            bridge.start();
            synchronized (bridges) {
                bridges.put(uri, bridge);
            }
...
}


public void onServiceRemove(DiscoveryEvent event) {
    String url = event.getServiceName();
    if (url != null) {
        URI uri;
        try {
            uri = new URI(url);
        } catch (URISyntaxException e) {
            LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
            return;
        }

        synchronized (bridges) {
            bridges.remove(uri);
        }
    }
}
{code}

There are a number of problems:

# The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
# The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.

The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).

Solution
========
The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  

This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

  was:
Symptom
=======
{{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.

Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:

# A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
# A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages

Cause
=====
{{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:

{code:title=DiscoveryNetworkConnector.java}
public void onServiceAdd(DiscoveryEvent event) {
...
        // Should we try to connect to that URI?
        synchronized (bridges) {
            if( bridges.containsKey(uri) ) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
                }
                return;
            }
        }
...
        NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
        try {
            bridge.start();
            synchronized (bridges) {
                bridges.put(uri, bridge);
            }
...
}


public void onServiceRemove(DiscoveryEvent event) {
    String url = event.getServiceName();
    if (url != null) {
        URI uri;
        try {
            uri = new URI(url);
        } catch (URISyntaxException e) {
            LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
            return;
        }

        synchronized (bridges) {
            bridges.remove(uri);
        }
    }
}
{code}

There are a number of problems:

# The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
# The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.

The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).

Solution
========
The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  

    
> DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Timothy Bish (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496715#comment-13496715 ] 

Timothy Bish commented on AMQ-4160:
-----------------------------------

Applied the latest updates, lets see how the tests fair with these changes.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160_2.patch, AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow updated AMQ-4160:
-------------------------------

    Summary: DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges  (was: DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges)
    
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Timothy Bish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Bish resolved AMQ-4160.
-------------------------------

    Resolution: Fixed

Seems to be working now.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160_2.patch, AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow reopened AMQ-4160:
--------------------------------

    Regression:   (was: Unit Test Broken)

Reopening this as the activeEvents map was not being cleared when the network connector was stopped; as a result, restarting the network connector would not result in the bridges being re-established.

The updated patch contains the following changes:
- Unit test updated to include verification of handleStop
- Unit test updated to allow inclusion of the previously removed test; now verifies pre and post patch
- Improved waitForBridge/hasBridge methods added to JmsMultipleBrokersTestSupport --- the existing waitForBridgeFormation methods are based on NetworkConnector.bridges, which is updated prior to connections (and thus the bridge) actually being formed --- the new waitForBridge method waits for broker info to be exchange before declaring that a bridge has foremd
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160_2.patch, AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Timothy Bish (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493537#comment-13493537 ] 

Timothy Bish commented on AMQ-4160:
-----------------------------------

Once you provide the updated patch we'll get this into a SNAPSHOT.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be considered.  In particular, there is a tight-coupling between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AMQ-4160) DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges

Posted by "Stirling Chow (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stirling Chow updated AMQ-4160:
-------------------------------

    Attachment: AMQ4160Test.java

Unit test that demonstrates the race conditions described in this ticket.  Both tests fail before the patch and pass after the patch.
                
> DiscoveryNetworkConnector can loss track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow the {{bridges}} data structure to become corrupt and not accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between {{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}} even though it is now stale and represents a failed bridge.  Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159 also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL --- this is used to prevent multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not* added by the corresponding discovery event.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira