You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2021/10/14 16:18:56 UTC

Re: Tomcat 9.0.x JDBC connection pool does not always remove abandoned connections

Gerhardt,

On 10/12/21 13:27, Martin, Gerhardt A wrote:
> Running Tomcat 9.0.50 on Centos 7.9.x Linux and using Tomcat JDBC connection pool to connect to my application's databases. My app connects to about a dozen read only databases and one read/write database. Here is a typical resource definition with tuning configurations for the pool and the MSSQL server JDBC driver:
> 
> <Resource

Note that these are the only relevant attributes:

 > factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"

You are using tomcat-pool, not the default commons-dbcp.

> maxActive="20"
> initialSize="1"
> maxIdle="8"
> minIdle="1"
> maxWait="5000"
> testOnBorrow="false" 
> testWhileIdle="false" 
> testOnConnect="false" 
> testOnReturn="false" 
> timeBetweenEvictionRunsMillis="10000"
> removeAbandonedTimeout="15"
> removeAbandoned="true"
> logAbandoned="false"
> minEvictableIdleTimeMillis="900000"
> maxAge="600000"



> Important NOTE: the framework we employ does NOT execute more than one SQL statement per borrow from the connection pool. It's a constant cycle of getConnection(borrow) ->execute 1 SQL->closeConnection(return)

That's kind of wasteful, but okay.

> Why are we configured this way:
> Ultimately we need to fail fast to avoid our APP going into death by stuck threads when we lose connectivity to the backend databases for a short period of time. ( for example a short 10 second period where the DB(s) is/are unreachable from the app servers).

Is this common?

> -We want the evictor thread (Pool Cleaner) to run every 10 seconds
> -We are only using the Pool Cleaner to clobber busy connections that have been running 15 seconds or more
>    NOTE: we rarely see SQL that takes longer than 2 seconds, most are sub 100ms and the avg is about 4ms

Note that the connections aren't clobbered; they are simply dropped from 
the pool. The work they generated is still continuing...

> -We are not testing connections period
> -We want our connections replaced with new fresh connections after 10 mins of use hence we set maxAge=600000
> The connection pool settings work hand in hand with properties set on the SQL server JDBC driver in the URL:
> The driver will kill any SQL that takes longer than 10 seconds

Ouch.

> -When everything else is OK, then the connection itself is returned intact back to the connection pool
> -When the DB is unreachable, there is another timer cancelQueryTimeout set to 10 seconds which is a safety valve for cases where the message to the DB server to cancel the SQL gracefully does not make it to the DB. After the
> cancleQueryTimeout, the client side of the driver gives up. This is problematic because it actually returns a bad connection back to the connection pool.
> 
> The combination of the connection pool settings to run the Pool cleaner every 10 seconds and removeAbandoned of 15 seconds should work together to remove any connection whose SQL execution time went past 10 seconds before the cancelQueryTimeout kicks in. This should prevent returning bad connections to the pool when bad things happen like a network switch going offline temporarily and dropping all packets between the app server hosting tomcat and the database server.

Sounds like you really just want to be validating connections, but you 
have explicitly disabled that feature. Why?

> NOTE that for the MSSQL server driver all their timeouts default to wait indefinitely. This is one of the prime reasons why even short outages can kill our APP with stuck threads. Our Web app hits tomcat about 2000 requests per minute and each request is going to fire on AVG ~6 SQL statements so we are running about 12K SQL hits per minute. If we lose connectivity to a database for even 15 seconds without a means to fail fast we could easily overcome the allotted number of threads which will create a "log jam" that can take a very long time to recover.

Why doesn't your query-timeout handle this situation?

> So I am very sorry for all this lenghty background but I thought it might help for readers and potential responders to know a bit about why we do what we do in our connection pool.
> 
> Finally now onto the problem:
> 
> I was running a load test and simulating the temporary loss of connectivity to one or more of the databases used in the APP and I noticed a strange behavior that does not align with my understanding of how Tomcat JDBC removeAbandoned works. While most of my database connections coming from the pool did die shortly after the simulation started and "failed fast" there were a few connections that actually ran for up to 5 minutes before they terminated themselves. Given the configuration I explained I would have expected no database connection that was interacting with the database to live beyond 30 seconds. ( giving that there could be some forward/reverse lag in the evictor thread timing where there could be a slight chance that the first run of the evictor thread might not catch every connection that was at the 15 seconds mark.)

Remember that the pool won't kill the connection or anything else. It 
just lets the connection "leak" out of the pool and ... hopefully 
eventually be garbage-collected. But if the thread running that query is 
still running, it will continue to do so potentially forever. Any 
settings you have for canceling the query after some timeout have 
nothing to do with the pool itself; that's between the driver, the 
server, etc.

> What I believe I saw in some cases was that the socket timeout that I set on the SQLServer JDBC driver for  300,000ms (5 mins) kicked in and read timeout was generated in the driver. This lead to a bad connection being returned to the connection pool and caused errors until the connection(s) aged out due to maxAge.

It sounds to me like you need to reduce the socket read timeout for the 
connections in the driver. Your maxAge applies at checkout-time. The 
"bad" connection will indeed be back in the pool, but it should be 
removed on the next checkout. If "removeAbandoned" is working as 
expected, a 5-minute query-time should have already resulted in the 
connection being removed from the pool (due to abandonment).

> Does anyone know why the Pool Cleaner would be able to run and NOT forcefully remove a connection from the tomcat JDBC pool that has been running more than configured removeAbandonedTimeout?

Have you confirmed that there is a pool-cleaner thread running?

> I searched google and this list on MARC for any clues and I found nothing. I also reviewed the source code for the Tomcat JDBC module and was unable to see a place in the code that indicated it would not be able to clobber a running connection  that was borrowed from the pool.
> 
> Anyone able to help me understand what I saw in my testing? Could this be a Tomcat connection pool bug?

It's possible. Have you enabled debug logging on the pool and checked 
your test again? I would highly recommend enabling "logAbandoned" if you 
are trying to debug problems with abandoned connection cleanup.

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: Tomcat 9.0.x JDBC connection pool does not always remove abandoned connections

Posted by "Martin, Gerhardt A" <Ge...@snapon.com>.
Chris, 

really appreciate you taking some time to respond. See my replies inline below.

> -----Original Message-----
> From: Christopher Schultz <ch...@christopherschultz.net>
> Sent: Thursday, October 14, 2021 12:19 PM
> To: users@tomcat.apache.org
> Subject: Re: Tomcat 9.0.x JDBC connection pool does not always remove
> abandoned connections
> 
> Gerhardt,
> 
> On 10/12/21 13:27, Martin, Gerhardt A wrote:
> > Running Tomcat 9.0.50 on Centos 7.9.x Linux and using Tomcat JDBC
> connection pool to connect to my application's databases. My app connects to
> about a dozen read only databases and one read/write database. Here is a
> typical resource definition with tuning configurations for the pool and the
> MSSQL server JDBC driver:
> >
> > <Resource
> 
> Note that these are the only relevant attributes:
> 
>  > factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
> 
> You are using tomcat-pool, not the default commons-dbcp.
 
100% aware of this and we plan to continue to use it.
> 
> > maxActive="20"
> > initialSize="1"
> > maxIdle="8"
> > minIdle="1"
> > maxWait="5000"
> > testOnBorrow="false"
> > testWhileIdle="false"
> > testOnConnect="false"
> > testOnReturn="false"
> > timeBetweenEvictionRunsMillis="10000"
> > removeAbandonedTimeout="15"
> > removeAbandoned="true"
> > logAbandoned="false"
> > minEvictableIdleTimeMillis="900000"
> > maxAge="600000"
> 
> 
> 
> > Important NOTE: the framework we employ does NOT execute more than one
> > SQL statement per borrow from the connection pool. It's a constant
> > cycle of getConnection(borrow) ->execute 1
> > SQL->closeConnection(return)
> 
> That's kind of wasteful, but okay.
I agree to some extent, but we have had stable success with this inherited approach. Tough to change this in our current situation but I have had thoughts about it. 
> 
> > Why are we configured this way:
> > Ultimately we need to fail fast to avoid our APP going into death by stuck
> threads when we lose connectivity to the backend databases for a short period
> of time. ( for example a short 10 second period where the DB(s) is/are
> unreachable from the app servers).
> 
> Is this common?
 
Not common, but when it occurs it is very disruptive to our customer base, hence the need to improve resiliency.  
> 
> > -We want the evictor thread (Pool Cleaner) to run every 10 seconds -We
> > are only using the Pool Cleaner to clobber busy connections that have been
> running 15 seconds or more
> >    NOTE: we rarely see SQL that takes longer than 2 seconds, most are
> > sub 100ms and the avg is about 4ms
> 
> Note that the connections aren't clobbered; they are simply dropped from the
> pool. The work they generated is still continuing...
> 
> > -We are not testing connections period -We want our connections
> > replaced with new fresh connections after 10 mins of use hence we set
> > maxAge=600000 The connection pool settings work hand in hand with
> properties set on the SQL server JDBC driver in the URL:
> > The driver will kill any SQL that takes longer than 10 seconds
> 
> Ouch.
> 
> > -When everything else is OK, then the connection itself is returned
> > intact back to the connection pool -When the DB is unreachable, there
> > is another timer cancelQueryTimeout set to 10 seconds which is a safety valve
> for cases where the message to the DB server to cancel the SQL gracefully does
> not make it to the DB. After the cancleQueryTimeout, the client side of the
> driver gives up. This is problematic because it actually returns a bad connection
> back to the connection pool.
> >
> > The combination of the connection pool settings to run the Pool cleaner every
> 10 seconds and removeAbandoned of 15 seconds should work together to
> remove any connection whose SQL execution time went past 10 seconds before
> the cancelQueryTimeout kicks in. This should prevent returning bad connections
> to the pool when bad things happen like a network switch going offline
> temporarily and dropping all packets between the app server hosting tomcat and
> the database server.
> 
> Sounds like you really just want to be validating connections, but you have
> explicitly disabled that feature. Why?
 
So we have a fairly high volume of SQL in our APPs and given our current borrow->execute SQL-> return situation, test on borrow or return doubles the number of SQL statements we run even if one of them is "SELECT 1". Our DBAs are not keen on this. Also there is no guarantee that after the SELECT 1 is successful the connection does not break. I have to say I am not so keen myself. I know that this can be softened with the validationInterval option in tomcat JDBC pool, but at ~90 SQL per second a 3 second wait between validations can easily rack up nearly ~270 APP level errors before the connection is checked. Tradeoffs are just pretty ugly here.  

> 
> > NOTE that for the MSSQL server driver all their timeouts default to wait
> indefinitely. This is one of the prime reasons why even short outages can kill our
> APP with stuck threads. Our Web app hits tomcat about 2000 requests per
> minute and each request is going to fire on AVG ~6 SQL statements so we are
> running about 12K SQL hits per minute. If we lose connectivity to a database for
> even 15 seconds without a means to fail fast we could easily overcome the
> allotted number of threads which will create a "log jam" that can take a very
> long time to recover.
> 
> Why doesn't your query-timeout handle this situation?
The query timeout does help this as long as the database is still reachable. The problem is that in cases where our path to the database is cut off ( as rare as it happens) unless we resort to test on borrow and/or return we run the risk of returning a dead connection into the pool after which gets borrowed many thousands of times causing many requests from customers to error out.    

> 
> > So I am very sorry for all this lenghty background but I thought it might help for
> readers and potential responders to know a bit about why we do what we do in
> our connection pool.
> >
> > Finally now onto the problem:
> >
> > I was running a load test and simulating the temporary loss of
> > connectivity to one or more of the databases used in the APP and I
> > noticed a strange behavior that does not align with my understanding
> > of how Tomcat JDBC removeAbandoned works. While most of my database
> > connections coming from the pool did die shortly after the simulation
> > started and "failed fast" there were a few connections that actually
> > ran for up to 5 minutes before they terminated themselves. Given the
> > configuration I explained I would have expected no database connection
> > that was interacting with the database to live beyond 30 seconds. (
> > giving that there could be some forward/reverse lag in the evictor
> > thread timing where there could be a slight chance that the first run
> > of the evictor thread might not catch every connection that was at the
> > 15 seconds mark.)
> 
> Remember that the pool won't kill the connection or anything else. It just lets
> the connection "leak" out of the pool and ... hopefully eventually be garbage-
> collected. But if the thread running that query is still running, it will continue to
> do so potentially forever. Any settings you have for canceling the query after
> some timeout have nothing to do with the pool itself; that's between the driver,
> the server, etc.
> 
> > What I believe I saw in some cases was that the socket timeout that I set on
> the SQLServer JDBC driver for  300,000ms (5 mins) kicked in and read timeout
> was generated in the driver. This lead to a bad connection being returned to the
> connection pool and caused errors until the connection(s) aged out due to
> maxAge.
> 
> It sounds to me like you need to reduce the socket read timeout for the
> connections in the driver. Your maxAge applies at checkout-time. The "bad"
> connection will indeed be back in the pool, but it should be removed on the next
> checkout. If "removeAbandoned" is working as expected, a 5-minute query-time
> should have already resulted in the connection being removed from the pool
> (due to abandonment).
Yes, I agree that this is a good step. I think if I set the socket read timeout to a more reasonable (15-20 seconds) we should be able to prevent tomcat threads running longer than 30 or so seconds. Its something I plan to test out.  

 

> 
> > Does anyone know why the Pool Cleaner would be able to run and NOT
> forcefully remove a connection from the tomcat JDBC pool that has been
> running more than configured removeAbandonedTimeout?
> 
> Have you confirmed that there is a pool-cleaner thread running?
> 
> > I searched google and this list on MARC for any clues and I found nothing. I
> also reviewed the source code for the Tomcat JDBC module and was unable to
> see a place in the code that indicated it would not be able to clobber a running
> connection  that was borrowed from the pool.
> >
> > Anyone able to help me understand what I saw in my testing? Could this be a
> Tomcat connection pool bug?
> 
> It's possible. Have you enabled debug logging on the pool and checked your test
> again? I would highly recommend enabling "logAbandoned" if you are trying to
> debug problems with abandoned connection cleanup.
 
I will give that a shot and I have a few more ideas on what to test out. 
 
> -chris

-Regards-

Gerhardt Martin