You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@activemq.apache.org by "Mario Lukica (JIRA)" <ji...@apache.org> on 2008/07/11 15:32:01 UTC

[jira] Created: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

bridge reconnection stops because of race in SimpleDiscoveryAgent
-----------------------------------------------------------------

                 Key: AMQ-1855
                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
             Project: ActiveMQ
          Issue Type: Bug
          Components: Connector
    Affects Versions: 4.1.2
            Reporter: Mario Lukica


I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge restart to fail, without starting new thread that should restart a bridge. As a consequence, network bridge is never restarted.

Following scenario leads to this:
1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too long)
2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with thread disposing the bridge, since given client subscription should be removed by thread disposing the bridge (step 2)
5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be made asynchronously, while previous bridge startup is still in progress)

As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.

serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect bridge. Reconnect logic is guarded by 
if( event.failed.compareAndSet(false, true) ) 

which tries to ensure that only a single thread is reconnecting bridge at some point.
{code}
    public void serviceFailed(DiscoveryEvent devent) throws IOException {
    	
        final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
        if( event.failed.compareAndSet(false, true) ) {
        	
			listener.onServiceRemove(event);
	    	Thread thread = new Thread() {
	    		public void run() {
	
	
	    			// We detect a failed connection attempt because the service fails right
	    			// away.
	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
	    				
	    				event.connectFailures++;
	    				
	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts ) {
	    					// Don' try to re-connect
	    					return;
	    				}
	    				
		                synchronized(sleepMutex){
		                    try{
		                    	if( !running.get() )
		                    		return;
		                    	
		                        sleepMutex.wait(event.reconnectDelay);
		                    }catch(InterruptedException ie){
                                Thread.currentThread().interrupt();
		                       return;
		                    }
		                }
	
		                if (!useExponentialBackOff) {
		                    event.reconnectDelay = initialReconnectDelay;
		                } else {
		                    // Exponential increment of reconnect delay.
		                    event.reconnectDelay*=backOffMultiplier;
		                    if(event.reconnectDelay>maxReconnectDelay)
		                        event.reconnectDelay=maxReconnectDelay;
		                }
		                
	    			} else {
	    				event.connectFailures = 0;
	                    event.reconnectDelay = initialReconnectDelay;
	    			}
	    			                    			
	            	if( !running.get() )
	            		return;
	            	
	    			event.connectTime = System.currentTimeMillis();
	    			event.failed.set(false);
	    			
	    			listener.onServiceAdd(event);
	    		}
	    	};
	    	thread.setDaemon(true);
	    	thread.start();
        }
    }
{code}

Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1), and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition: if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter 
sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all other calls to serviceFailed not to start thread that will reconnect bridge).

If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail because of following condition in DiscoveryNetworkConnector:
{code}
            if (    bridges.containsKey(uri) 
                    || localURI.equals(uri) 
                    || (connectionFilter!=null && !connectionFilter.connectTo(uri))
                    )
                return;
{code}

bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
No additional attempt to reconnect bridge will be made, since T2 held event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing local or remote bridge exceptions.

End result:
- DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) == true) 
- SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not started

I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed should be verified and/or improved.
Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge - waiting for a second more to restart a bridge is better then not to start it at all

I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Posted by "Rob Davies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rob Davies reassigned AMQ-1855:
-------------------------------

    Assignee: Rob Davies

> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
>                 Key: AMQ-1855
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Connector
>    Affects Versions: 4.1.2
>            Reporter: Mario Lukica
>            Assignee: Rob Davies
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge restart to fail, without starting new thread that should restart a bridge. As a consequence, network bridge is never restarted.
> Following scenario leads to this:
> 1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too long)
> 2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
> 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
> 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with thread disposing the bridge, since given client subscription should be removed by thread disposing the bridge (step 2)
> 5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be made asynchronously, while previous bridge startup is still in progress)
> As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.
> serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect bridge. Reconnect logic is guarded by 
> if( event.failed.compareAndSet(false, true) ) 
> which tries to ensure that only a single thread is reconnecting bridge at some point.
> {code}
>     public void serviceFailed(DiscoveryEvent devent) throws IOException {
>     	
>         final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
>         if( event.failed.compareAndSet(false, true) ) {
>         	
> 			listener.onServiceRemove(event);
> 	    	Thread thread = new Thread() {
> 	    		public void run() {
> 	
> 	
> 	    			// We detect a failed connection attempt because the service fails right
> 	    			// away.
> 	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
> 	    				
> 	    				event.connectFailures++;
> 	    				
> 	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts ) {
> 	    					// Don' try to re-connect
> 	    					return;
> 	    				}
> 	    				
> 		                synchronized(sleepMutex){
> 		                    try{
> 		                    	if( !running.get() )
> 		                    		return;
> 		                    	
> 		                        sleepMutex.wait(event.reconnectDelay);
> 		                    }catch(InterruptedException ie){
>                                 Thread.currentThread().interrupt();
> 		                       return;
> 		                    }
> 		                }
> 	
> 		                if (!useExponentialBackOff) {
> 		                    event.reconnectDelay = initialReconnectDelay;
> 		                } else {
> 		                    // Exponential increment of reconnect delay.
> 		                    event.reconnectDelay*=backOffMultiplier;
> 		                    if(event.reconnectDelay>maxReconnectDelay)
> 		                        event.reconnectDelay=maxReconnectDelay;
> 		                }
> 		                
> 	    			} else {
> 	    				event.connectFailures = 0;
> 	                    event.reconnectDelay = initialReconnectDelay;
> 	    			}
> 	    			                    			
> 	            	if( !running.get() )
> 	            		return;
> 	            	
> 	    			event.connectTime = System.currentTimeMillis();
> 	    			event.failed.set(false);
> 	    			
> 	    			listener.onServiceAdd(event);
> 	    		}
> 	    	};
> 	    	thread.setDaemon(true);
> 	    	thread.start();
>         }
>     }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1), and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition: if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter 
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all other calls to serviceFailed not to start thread that will reconnect bridge).
> If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail because of following condition in DiscoveryNetworkConnector:
> {code}
>             if (    bridges.containsKey(uri) 
>                     || localURI.equals(uri) 
>                     || (connectionFilter!=null && !connectionFilter.connectTo(uri))
>                     )
>                 return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
> No additional attempt to reconnect bridge will be made, since T2 held event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing local or remote bridge exceptions.
> End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) == true) 
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not started
> I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed should be verified and/or improved.
> Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge - waiting for a second more to restart a bridge is better then not to start it at all
> I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Posted by "Gary Tully (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Tully resolved AMQ-1855.
-----------------------------

    Resolution: Fixed

though I could not reproduce, the changes in revision http://svn.apache.org/viewcvs?view=rev&rev=808890 will fix this issue for both the simple discovery and multicast discovery providers.

> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
>                 Key: AMQ-1855
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Connector
>    Affects Versions: 4.1.2
>            Reporter: Mario Lukica
>            Assignee: Gary Tully
>             Fix For: 5.3.0
>
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge restart to fail, without starting new thread that should restart a bridge. As a consequence, network bridge is never restarted.
> Following scenario leads to this:
> 1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too long)
> 2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
> 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
> 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with thread disposing the bridge, since given client subscription should be removed by thread disposing the bridge (step 2)
> 5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be made asynchronously, while previous bridge startup is still in progress)
> As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.
> serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect bridge. Reconnect logic is guarded by 
> if( event.failed.compareAndSet(false, true) ) 
> which tries to ensure that only a single thread is reconnecting bridge at some point.
> {code}
>     public void serviceFailed(DiscoveryEvent devent) throws IOException {
>     	
>         final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
>         if( event.failed.compareAndSet(false, true) ) {
>         	
> 			listener.onServiceRemove(event);
> 	    	Thread thread = new Thread() {
> 	    		public void run() {
> 	
> 	
> 	    			// We detect a failed connection attempt because the service fails right
> 	    			// away.
> 	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
> 	    				
> 	    				event.connectFailures++;
> 	    				
> 	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts ) {
> 	    					// Don' try to re-connect
> 	    					return;
> 	    				}
> 	    				
> 		                synchronized(sleepMutex){
> 		                    try{
> 		                    	if( !running.get() )
> 		                    		return;
> 		                    	
> 		                        sleepMutex.wait(event.reconnectDelay);
> 		                    }catch(InterruptedException ie){
>                                 Thread.currentThread().interrupt();
> 		                       return;
> 		                    }
> 		                }
> 	
> 		                if (!useExponentialBackOff) {
> 		                    event.reconnectDelay = initialReconnectDelay;
> 		                } else {
> 		                    // Exponential increment of reconnect delay.
> 		                    event.reconnectDelay*=backOffMultiplier;
> 		                    if(event.reconnectDelay>maxReconnectDelay)
> 		                        event.reconnectDelay=maxReconnectDelay;
> 		                }
> 		                
> 	    			} else {
> 	    				event.connectFailures = 0;
> 	                    event.reconnectDelay = initialReconnectDelay;
> 	    			}
> 	    			                    			
> 	            	if( !running.get() )
> 	            		return;
> 	            	
> 	    			event.connectTime = System.currentTimeMillis();
> 	    			event.failed.set(false);
> 	    			
> 	    			listener.onServiceAdd(event);
> 	    		}
> 	    	};
> 	    	thread.setDaemon(true);
> 	    	thread.start();
>         }
>     }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1), and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition: if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter 
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all other calls to serviceFailed not to start thread that will reconnect bridge).
> If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail because of following condition in DiscoveryNetworkConnector:
> {code}
>             if (    bridges.containsKey(uri) 
>                     || localURI.equals(uri) 
>                     || (connectionFilter!=null && !connectionFilter.connectTo(uri))
>                     )
>                 return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
> No additional attempt to reconnect bridge will be made, since T2 held event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing local or remote bridge exceptions.
> End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) == true) 
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not started
> I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed should be verified and/or improved.
> Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge - waiting for a second more to restart a bridge is better then not to start it at all
> I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Posted by "Gary Tully (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Tully reassigned AMQ-1855:
-------------------------------

    Assignee: Gary Tully  (was: Rob Davies)

> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
>                 Key: AMQ-1855
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Connector
>    Affects Versions: 4.1.2
>            Reporter: Mario Lukica
>            Assignee: Gary Tully
>             Fix For: 5.4.0
>
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge restart to fail, without starting new thread that should restart a bridge. As a consequence, network bridge is never restarted.
> Following scenario leads to this:
> 1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too long)
> 2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
> 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
> 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with thread disposing the bridge, since given client subscription should be removed by thread disposing the bridge (step 2)
> 5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be made asynchronously, while previous bridge startup is still in progress)
> As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.
> serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect bridge. Reconnect logic is guarded by 
> if( event.failed.compareAndSet(false, true) ) 
> which tries to ensure that only a single thread is reconnecting bridge at some point.
> {code}
>     public void serviceFailed(DiscoveryEvent devent) throws IOException {
>     	
>         final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
>         if( event.failed.compareAndSet(false, true) ) {
>         	
> 			listener.onServiceRemove(event);
> 	    	Thread thread = new Thread() {
> 	    		public void run() {
> 	
> 	
> 	    			// We detect a failed connection attempt because the service fails right
> 	    			// away.
> 	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
> 	    				
> 	    				event.connectFailures++;
> 	    				
> 	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts ) {
> 	    					// Don' try to re-connect
> 	    					return;
> 	    				}
> 	    				
> 		                synchronized(sleepMutex){
> 		                    try{
> 		                    	if( !running.get() )
> 		                    		return;
> 		                    	
> 		                        sleepMutex.wait(event.reconnectDelay);
> 		                    }catch(InterruptedException ie){
>                                 Thread.currentThread().interrupt();
> 		                       return;
> 		                    }
> 		                }
> 	
> 		                if (!useExponentialBackOff) {
> 		                    event.reconnectDelay = initialReconnectDelay;
> 		                } else {
> 		                    // Exponential increment of reconnect delay.
> 		                    event.reconnectDelay*=backOffMultiplier;
> 		                    if(event.reconnectDelay>maxReconnectDelay)
> 		                        event.reconnectDelay=maxReconnectDelay;
> 		                }
> 		                
> 	    			} else {
> 	    				event.connectFailures = 0;
> 	                    event.reconnectDelay = initialReconnectDelay;
> 	    			}
> 	    			                    			
> 	            	if( !running.get() )
> 	            		return;
> 	            	
> 	    			event.connectTime = System.currentTimeMillis();
> 	    			event.failed.set(false);
> 	    			
> 	    			listener.onServiceAdd(event);
> 	    		}
> 	    	};
> 	    	thread.setDaemon(true);
> 	    	thread.start();
>         }
>     }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1), and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition: if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter 
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all other calls to serviceFailed not to start thread that will reconnect bridge).
> If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail because of following condition in DiscoveryNetworkConnector:
> {code}
>             if (    bridges.containsKey(uri) 
>                     || localURI.equals(uri) 
>                     || (connectionFilter!=null && !connectionFilter.connectTo(uri))
>                     )
>                 return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
> No additional attempt to reconnect bridge will be made, since T2 held event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing local or remote bridge exceptions.
> End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) == true) 
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not started
> I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed should be verified and/or improved.
> Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge - waiting for a second more to restart a bridge is better then not to start it at all
> I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Posted by "Gary Tully (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=53828#action_53828 ] 

Gary Tully commented on AMQ-1855:
---------------------------------

the enhancements in http://svn.apache.org/viewvc?rev=805361&view=rev can help configure a faster timeout on a discovered network connector. 

> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
>                 Key: AMQ-1855
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Connector
>    Affects Versions: 4.1.2
>            Reporter: Mario Lukica
>            Assignee: Gary Tully
>             Fix For: 5.3.0
>
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge restart to fail, without starting new thread that should restart a bridge. As a consequence, network bridge is never restarted.
> Following scenario leads to this:
> 1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too long)
> 2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
> 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
> 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with thread disposing the bridge, since given client subscription should be removed by thread disposing the bridge (step 2)
> 5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be made asynchronously, while previous bridge startup is still in progress)
> As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.
> serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect bridge. Reconnect logic is guarded by 
> if( event.failed.compareAndSet(false, true) ) 
> which tries to ensure that only a single thread is reconnecting bridge at some point.
> {code}
>     public void serviceFailed(DiscoveryEvent devent) throws IOException {
>     	
>         final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
>         if( event.failed.compareAndSet(false, true) ) {
>         	
> 			listener.onServiceRemove(event);
> 	    	Thread thread = new Thread() {
> 	    		public void run() {
> 	
> 	
> 	    			// We detect a failed connection attempt because the service fails right
> 	    			// away.
> 	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
> 	    				
> 	    				event.connectFailures++;
> 	    				
> 	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts ) {
> 	    					// Don' try to re-connect
> 	    					return;
> 	    				}
> 	    				
> 		                synchronized(sleepMutex){
> 		                    try{
> 		                    	if( !running.get() )
> 		                    		return;
> 		                    	
> 		                        sleepMutex.wait(event.reconnectDelay);
> 		                    }catch(InterruptedException ie){
>                                 Thread.currentThread().interrupt();
> 		                       return;
> 		                    }
> 		                }
> 	
> 		                if (!useExponentialBackOff) {
> 		                    event.reconnectDelay = initialReconnectDelay;
> 		                } else {
> 		                    // Exponential increment of reconnect delay.
> 		                    event.reconnectDelay*=backOffMultiplier;
> 		                    if(event.reconnectDelay>maxReconnectDelay)
> 		                        event.reconnectDelay=maxReconnectDelay;
> 		                }
> 		                
> 	    			} else {
> 	    				event.connectFailures = 0;
> 	                    event.reconnectDelay = initialReconnectDelay;
> 	    			}
> 	    			                    			
> 	            	if( !running.get() )
> 	            		return;
> 	            	
> 	    			event.connectTime = System.currentTimeMillis();
> 	    			event.failed.set(false);
> 	    			
> 	    			listener.onServiceAdd(event);
> 	    		}
> 	    	};
> 	    	thread.setDaemon(true);
> 	    	thread.start();
>         }
>     }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1), and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition: if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter 
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all other calls to serviceFailed not to start thread that will reconnect bridge).
> If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail because of following condition in DiscoveryNetworkConnector:
> {code}
>             if (    bridges.containsKey(uri) 
>                     || localURI.equals(uri) 
>                     || (connectionFilter!=null && !connectionFilter.connectTo(uri))
>                     )
>                 return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
> No additional attempt to reconnect bridge will be made, since T2 held event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing local or remote bridge exceptions.
> End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) == true) 
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not started
> I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed should be verified and/or improved.
> Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge - waiting for a second more to restart a bridge is better then not to start it at all
> I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Posted by "Gary Tully (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Tully updated AMQ-1855:
----------------------------

    Fix Version/s:     (was: 5.4.0)
                   5.3.0

> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
>                 Key: AMQ-1855
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Connector
>    Affects Versions: 4.1.2
>            Reporter: Mario Lukica
>            Assignee: Gary Tully
>             Fix For: 5.3.0
>
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge restart to fail, without starting new thread that should restart a bridge. As a consequence, network bridge is never restarted.
> Following scenario leads to this:
> 1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too long)
> 2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
> 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
> 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with thread disposing the bridge, since given client subscription should be removed by thread disposing the bridge (step 2)
> 5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be made asynchronously, while previous bridge startup is still in progress)
> As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.
> serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect bridge. Reconnect logic is guarded by 
> if( event.failed.compareAndSet(false, true) ) 
> which tries to ensure that only a single thread is reconnecting bridge at some point.
> {code}
>     public void serviceFailed(DiscoveryEvent devent) throws IOException {
>     	
>         final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
>         if( event.failed.compareAndSet(false, true) ) {
>         	
> 			listener.onServiceRemove(event);
> 	    	Thread thread = new Thread() {
> 	    		public void run() {
> 	
> 	
> 	    			// We detect a failed connection attempt because the service fails right
> 	    			// away.
> 	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
> 	    				
> 	    				event.connectFailures++;
> 	    				
> 	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts ) {
> 	    					// Don' try to re-connect
> 	    					return;
> 	    				}
> 	    				
> 		                synchronized(sleepMutex){
> 		                    try{
> 		                    	if( !running.get() )
> 		                    		return;
> 		                    	
> 		                        sleepMutex.wait(event.reconnectDelay);
> 		                    }catch(InterruptedException ie){
>                                 Thread.currentThread().interrupt();
> 		                       return;
> 		                    }
> 		                }
> 	
> 		                if (!useExponentialBackOff) {
> 		                    event.reconnectDelay = initialReconnectDelay;
> 		                } else {
> 		                    // Exponential increment of reconnect delay.
> 		                    event.reconnectDelay*=backOffMultiplier;
> 		                    if(event.reconnectDelay>maxReconnectDelay)
> 		                        event.reconnectDelay=maxReconnectDelay;
> 		                }
> 		                
> 	    			} else {
> 	    				event.connectFailures = 0;
> 	                    event.reconnectDelay = initialReconnectDelay;
> 	    			}
> 	    			                    			
> 	            	if( !running.get() )
> 	            		return;
> 	            	
> 	    			event.connectTime = System.currentTimeMillis();
> 	    			event.failed.set(false);
> 	    			
> 	    			listener.onServiceAdd(event);
> 	    		}
> 	    	};
> 	    	thread.setDaemon(true);
> 	    	thread.start();
>         }
>     }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1), and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition: if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter 
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all other calls to serviceFailed not to start thread that will reconnect bridge).
> If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail because of following condition in DiscoveryNetworkConnector:
> {code}
>             if (    bridges.containsKey(uri) 
>                     || localURI.equals(uri) 
>                     || (connectionFilter!=null && !connectionFilter.connectTo(uri))
>                     )
>                 return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
> No additional attempt to reconnect bridge will be made, since T2 held event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing local or remote bridge exceptions.
> End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) == true) 
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not started
> I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed should be verified and/or improved.
> Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge - waiting for a second more to restart a bridge is better then not to start it at all
> I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.