You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@activemq.apache.org by "Jeremy Lucier (JIRA)" <ji...@apache.org> on 2013/12/24 15:18:50 UTC
[jira] [Updated] (AMQ-4954) PooledConnectionFactory Race Condition & Failover Probems

     [ https://issues.apache.org/jira/browse/AMQ-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Lucier updated AMQ-4954:
-------------------------------

    Description: 
[Sorry for this long bug report]

We're currently using ActiveMQ 5.9 via a Camel implementation for consuming/polling messages off of queues as well as sending messages.  With the latest version of ActiveMQ both client and server-side (5.9), there are numerous problems we've discovered with the changes to PooledConnectionFactory.  These problems might have existed prior, but it wasn't until recently that we were able to do any basic load testing on our system.

[1] The first one is a race condition, which is actually documented within the code itself via a comment (createConnection method within PooledConnectionFactory).  If we start our Camel-based application while the broker is down, and then bring the broker up -- it will take an indeterminable amount of time to reconnect (typically a few hours of thrashing).  The reason being is that on start up or failover all the routes/threads are attempting to get a connection and establish a new session via PooledConnectionFactory in unison over and over again.  That's fine and expected behavior, but leads us to the next problem that contributes to the race condition.

[2] The PooledConnectionFactory continues to return expired/expiring ConnectionPool's, and the calling threads fight to increment/decrement the internal references counter within that object.  Eventually they might win and hit 0 for the expiredCheck to return true, but again it's random and usually after a long period of time.  It's very easy to have happen if you only have 1 connection in your pool.

The fix that I originally implemented in my own code, was to just add an "isExpiring" flag to the ConnectionPool and just continue to throw the bad connection back in the pool with the hopes of it invalidating as the while loop continued to grab another connection (to keep things as close to the original behavior as possible). That fixed the startup with the broker down problem, however introduced another problem if failover occurs after good connection(s) are established .  Basically I learned that the PooledConnectionFactory never validated/started/established a working connection (it has hooks for a connection error to set it as expiring, but the start up and validation happens later on in the workflow). That lead to the DefaultJmsMessageContainer continually flagging the connection as bad, and continually trying to refresh it.  Which basically ends up with the system creating a single connection with one session, and then it terminates it after it's used over and over again.  Not ideal since we're looking to pool.

So I added validation to PooledConnectionFactory's cache itself to start the connection if it's a newly created ConnectionPool to ensure it works. That worked great, except, the problem is createConnection is synch'd and has a while loop that under certain conditions of failover leads to the createConnection blocking all connection threads indefinitely.  This is because there might be a thread that incremented one of the ConnectionPool's in the cache and is now blocked waiting on a new createConnection call (since another thread is in the while loop waiting on a new non expiring connection) and thus the decrement never gets issued.  Introducing a max limit to the loop addresses that (or a similar base case), but really it's not ideal either since you'd assume all things in the cache are valid.

In general what ended up being a short "look around" to fix a race condition ended up with me redoing quite a bit of functionality, and redesigning how a "close" and cache destroy occurs on a ConnectionPool.  I think a lot of this was done initially in an attempt to maximize some of the features of the cache implementation that's in place.  I can get my code modifications pulled down from my system and attached to this ticket if interested, otherwise I just wanted to raise some of the issues to your guys' attention.

Hopefully this made sense, otherwise I can clarify further.

  was:
[Sorry for this long bug report]

We're currently using ActiveMQ 5.9 via a Camel implementation for consuming/polling messages off of queues as well as sending messages.  With the latest version of ActiveMQ both client and server-side (5.9), there are numerous problems we've discovered with the changes to PooledConnectionFactory.  These problems might have existed prior, but it wasn't until recently that we were able to do any basic load testing on our system.

[1] The first one is a race condition, which is actually documented within the code itself via a comment (createConnection method within PooledConnectionFactory).  If we start our Camel-based application while the broker is down, and then bring the broker up -- it will take an indeterminable amount of time to reconnect (typically a few hours of thrashing).  The reason being is that on start up or failover all the routes/threads are attempting to get a connection and establish a new session via PooledConnectionFactory in unison over and over again.  That's fine and expected behavior, but leads us to the next problem that contributes to the race condition.

[2] The PooledConnectionFactory continues to return expired/expiring ConnectionPool's, and the calling threads fight to increment/decrement the internal references counter within that object.  Eventually they might win and hit 0 for the expiredCheck to return true, but again it's random and usually after a long period of time.  It's very easy to have happen if you only have 1 connection in your pool.

The fix that I originally implemented in my own code, was to just add an "isExpiring" flag to the ConnectionPool and just continue to throw the bad connection back in the pool with the hopes of it invalidating as the while loop continued to grab another connection (to keep things as close to the original behavior as possible). That fixed the startup with the broker down problem, however introduced another problem if failover occurs after good connection(s) are established .  Basically I learned that the PooledConnectionFactory never validated/started/established a working connection (it has hooks for a connection error to set it as expiring, but the start up and validation happens later on in the workflow). That lead to the DefaultJmsMessageContainer continually flagging the connection as bad, and continually trying to refresh it.  Which basically ends up with the system creating a single connection with one session, and then it terminates it after it's used over and over again.  Not ideal since we're looking to pool.

So I added validation to PooledConnectionFactory's cache itself to start the connection if it's a newly created ConnectionPool to ensure it works.  That worked great, except, the problem is createConnection is synch'd and has a while loop that under certain conditions of failover leads to the createConnection blocking all connection threads indefinitely.  Introducing a max limit to the loop addresses that (or a similar base case), but really it's not ideal either since you'd assume all things in the cache are valid.

In general what ended up being a short "look around" to fix a race condition ended up with me redoing quite a bit of functionality, and redesigning how a "close" and cache destroy occurs on a ConnectionPool.  I think a lot of this was done initially in an attempt to maximize some of the features of the cache implementation that's in place.  I can get my code modifications pulled down from my system and attached to this ticket if interested, otherwise I just wanted to raise some of the issues to your guys' attention.

Hopefully this made sense, otherwise I can clarify further.


> PooledConnectionFactory Race Condition & Failover Probems
> ---------------------------------------------------------
>
>                 Key: AMQ-4954
>                 URL: https://issues.apache.org/jira/browse/AMQ-4954
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: JMS client
>    Affects Versions: 5.9.0
>         Environment: Linux 64bit, 8-core, 16gb RAM.
>            Reporter: Jeremy Lucier
>
> [Sorry for this long bug report]
> We're currently using ActiveMQ 5.9 via a Camel implementation for consuming/polling messages off of queues as well as sending messages.  With the latest version of ActiveMQ both client and server-side (5.9), there are numerous problems we've discovered with the changes to PooledConnectionFactory.  These problems might have existed prior, but it wasn't until recently that we were able to do any basic load testing on our system.
> [1] The first one is a race condition, which is actually documented within the code itself via a comment (createConnection method within PooledConnectionFactory).  If we start our Camel-based application while the broker is down, and then bring the broker up -- it will take an indeterminable amount of time to reconnect (typically a few hours of thrashing).  The reason being is that on start up or failover all the routes/threads are attempting to get a connection and establish a new session via PooledConnectionFactory in unison over and over again.  That's fine and expected behavior, but leads us to the next problem that contributes to the race condition.
> [2] The PooledConnectionFactory continues to return expired/expiring ConnectionPool's, and the calling threads fight to increment/decrement the internal references counter within that object.  Eventually they might win and hit 0 for the expiredCheck to return true, but again it's random and usually after a long period of time.  It's very easy to have happen if you only have 1 connection in your pool.
> The fix that I originally implemented in my own code, was to just add an "isExpiring" flag to the ConnectionPool and just continue to throw the bad connection back in the pool with the hopes of it invalidating as the while loop continued to grab another connection (to keep things as close to the original behavior as possible). That fixed the startup with the broker down problem, however introduced another problem if failover occurs after good connection(s) are established .  Basically I learned that the PooledConnectionFactory never validated/started/established a working connection (it has hooks for a connection error to set it as expiring, but the start up and validation happens later on in the workflow). That lead to the DefaultJmsMessageContainer continually flagging the connection as bad, and continually trying to refresh it.  Which basically ends up with the system creating a single connection with one session, and then it terminates it after it's used over and over again.  Not ideal since we're looking to pool.
> So I added validation to PooledConnectionFactory's cache itself to start the connection if it's a newly created ConnectionPool to ensure it works. That worked great, except, the problem is createConnection is synch'd and has a while loop that under certain conditions of failover leads to the createConnection blocking all connection threads indefinitely.  This is because there might be a thread that incremented one of the ConnectionPool's in the cache and is now blocked waiting on a new createConnection call (since another thread is in the while loop waiting on a new non expiring connection) and thus the decrement never gets issued.  Introducing a max limit to the loop addresses that (or a similar base case), but really it's not ideal either since you'd assume all things in the cache are valid.
> In general what ended up being a short "look around" to fix a race condition ended up with me redoing quite a bit of functionality, and redesigning how a "close" and cache destroy occurs on a ConnectionPool.  I think a lot of this was done initially in an attempt to maximize some of the features of the cache implementation that's in place.  I can get my code modifications pulled down from my system and attached to this ticket if interested, otherwise I just wanted to raise some of the issues to your guys' attention.
> Hopefully this made sense, otherwise I can clarify further.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)