You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by Sean Chittenden <se...@serverninjas.com> on 2000/02/24 00:03:00 UTC

Database connection pooling... (beyond Apache::DBI)

	Howdy.  We're all probably pretty familiar with Apache::DBI and
the fact that it opens a database connection per apache process.  Sounds
groovy and works well with only one or two servers.  Everything is gravy
until you get a cluster of servers, ie 20-30 machines, each with 300+
processes.  I've looked into DBD::Proxy and DBI::ProxyServer, which seems
really cool and I'm wondering if anyone has any experience with those two
modules.  Yes? No?
	Are there any great solutions to this problem aside from tossing a
TON of RAM into the database server and turning up the number of database
connections?  I have a few ideas of my own, but I don't want to reinvent
the wheel and want to hear what's out there.  Thanks.  --SC


	PS  I'd like to post the results of this discussion and compile it
into something that would be suitable for the FAQ.

-- 
Sean Chittenden
sean.chittenden@usa.net

Re: Database connection pooling... (beyond Apache::DBI)

Posted by Bill <mc...@isis2000.com>.

"Jeffrey W. Baker" wrote:
> 
> This conversation happens on an approximately biweekly schedule, either
> on modperl or dbi-users, or some other list I have the misfortune of
> frequenting.  Please allow me to expand upon this subject a bit.
> 
> I have not yet gotten a satisfactory answer from anyone who starts these
> threads regarding why they want connection pooling.  I suspect that
> people think it is needed because everyone else (Netscape, Microsoft,
> Bea) is doing it.  There is a particular kind of application where
> pooled connections are useful, and there are particular situations where
> it is a waste.  Every project I have ever done falls into the latter
> category, and I can only think of a few cases that fall under the
> former.

   Interesting point of view. I'd never stopped to think about how necessary
it REALLY was...

> Connection pooling is a system where your application server threads or
> processes, which number n on a single machine, share a pool of database
> connections which number fewer than n.  This is done to minimize the
> number of database connections which are open at once, which in turn is
> supposed to reduce the load on the database server.  This is effective
> when database activity is a small fraction of the total load on the
> application server.  For example, if your application server mostly
> performs matrix manipulation, and only occassionally hits the database,
> it would make sense for it to relinquish the connection when it is not
> in use.
> 
> The downside to connection pooling is that it imposes some overhead.
> The connections must be managed properly, and the scheme should be
> transparent to the programmer whose code is using it.  So when a piece
> of code requests a database connection, the connection manager needs to
> decide which one to return.  It may have to wait for one to free up, or
> it may have to open one based on some low-water-mark hueristic.  It may
> also need to decide that a connection consumer has died or gone away,
> possibly taking the connection with it.  So you can see that opening a
> pooled connection is more computationally expensive than opening a
> dedicated connection.

   Ah, but just opening a dedicated connection is so slow as to be avoided
as much as possible. Adding a little overhead here wouldn't seem to hurt,
since the overhead to begin with already makes you want to reopen your
db handles as infrequently as you can.

> This pooling overhead is a total waste of time when the majority of what
> your application is doing is database-related.  If your program will
> issue 100 queries and performa transaction during the course of
> fulfilling a request, pooled connections will not make sense.  The
> reason is that Apache already provides a mechanism for killing off
> database connections in this scenario.  If a process or thread is
> sitting about idle, Apache will come along an terminate it, freeing the
> database connection in the process.  For database-bound or transactional
> programs, the one-to-one mapping of processes to database connections is
> ideal.

   There is certainly something that's cleaner about this approach. I hadn't
really thought of this either.

> Pooling is also less attractive because modern databases can handle many
> connections.  Oracle with MTS will run fine with just as many

   Ok, stupid question: What is MTS? Multithreaded Server? Part of the
Enterprise package of Standard?

> connections as you care to open.  The application designer should study
> how many connections he realistically plans to open.  If your
> application is bound by database performance, it makes sense to cap the
> number of clients, so you would not allow you applications to open too
> many connections.  If your application is transactional, you don't have
> any choice but to give each processes its own dedicated connection.  If
> your application is compute-bound, then your database is lightly loaded
> and you won't mind opening a lot of connections.
> 
> The summary is that if your application is database-bound, or is
> processing transactions, you don't need or even want connection pooling.
> 
> -jwb

   Hmm...are you sure the point about transactions is true? Couldn't you do
database connection pooling and have transactions going, as long as you
ensured any previous transactions had been rolled back? I can see where it
could potentially get ugly to determine the exact state of a db handle, but
I know that uglier problems have been solved. :) I think my desire for
pooled database connections is related to just wanting flexibility. It's
great that we have a decent system for doing database access from mod_perl, 
but if there were a clean way to get pooled database transactions, I wouldn't
have to worry about the ratio of db-related handlers vs not. I could just
write code that would run fast, maybe do some db stuff or not, and then
just maybe play with the number of db connections if that became a 
contention. Finally, there are licensing concerns for some that would make
pushing the number of db connections down advantageous. It would be ideal
to be able to make the choice between pooled db connection or not, depending
on what you decided was a good fit. One last case where a smaller number of
db connections would make sense is when the db accesses made by handlers
are pretty short. If you hit the db for an insert or something and the amount 
of time spent in the db is small compared to the rest of the time needed
to process the request (say for an ad server or something) AND you need to
handle a large number of requests per second, then it would be nice to
just have a few database handles shared. Realistically I don't know that
any overhead imposed by pooled db connections wouldn't get in the way of
things, but hey, it's a theoretical example. :) Anyway, perhaps this
topic gets beaten to death but I still find it interesting to discuss...

- Bill

Re: Database connection pooling... (beyond Apache::DBI)

Posted by "Jeffrey W. Baker" <jw...@acm.org>.

Greg Stark wrote:
> 
> Sean Chittenden <se...@serverninjas.com> writes:
> 
> >       Howdy.  We're all probably pretty familiar with Apache::DBI and
> > the fact that it opens a database connection per apache process.  Sounds
> > groovy and works well with only one or two servers.  Everything is gravy
> > until you get a cluster of servers, ie 20-30 machines, each with 300+
> > processes.
> 
> 300+ perl processes per machine? No way. The only way that would make _any_
> sense is if your perl code is extremely i/o dependent and your perl code is
> extremely light. Even then you're way better off having the i/o operations
> queued quickly and processed asynchronously.

This conversation happens on an approximately biweekly schedule, either
on modperl or dbi-users, or some other list I have the misfortune of
frequenting.  Please allow me to expand upon this subject a bit.

I have not yet gotten a satisfactory answer from anyone who starts these
threads regarding why they want connection pooling.  I suspect that
people think it is needed because everyone else (Netscape, Microsoft,
Bea) is doing it.  There is a particular kind of application where
pooled connections are useful, and there are particular situations where
it is a waste.  Every project I have ever done falls into the latter
category, and I can only think of a few cases that fall under the
former.

Connection pooling is a system where your application server threads or
processes, which number n on a single machine, share a pool of database
connections which number fewer than n.  This is done to minimize the
number of database connections which are open at once, which in turn is
supposed to reduce the load on the database server.  This is effective
when database activity is a small fraction of the total load on the
application server.  For example, if your application server mostly
performs matrix manipulation, and only occassionally hits the database,
it would make sense for it to relinquish the connection when it is not
in use.

The downside to connection pooling is that it imposes some overhead. 
The connections must be managed properly, and the scheme should be
transparent to the programmer whose code is using it.  So when a piece
of code requests a database connection, the connection manager needs to
decide which one to return.  It may have to wait for one to free up, or
it may have to open one based on some low-water-mark hueristic.  It may
also need to decide that a connection consumer has died or gone away,
possibly taking the connection with it.  So you can see that opening a
pooled connection is more computationally expensive than opening a
dedicated connection.

This pooling overhead is a total waste of time when the majority of what
your application is doing is database-related.  If your program will
issue 100 queries and performa transaction during the course of
fulfilling a request, pooled connections will not make sense.  The
reason is that Apache already provides a mechanism for killing off
database connections in this scenario.  If a process or thread is
sitting about idle, Apache will come along an terminate it, freeing the
database connection in the process.  For database-bound or transactional
programs, the one-to-one mapping of processes to database connections is
ideal.

Pooling is also less attractive because modern databases can handle many
connections.  Oracle with MTS will run fine with just as many
connections as you care to open.  The application designer should study
how many connections he realistically plans to open.  If your
application is bound by database performance, it makes sense to cap the
number of clients, so you would not allow you applications to open too
many connections.  If your application is transactional, you don't have
any choice but to give each processes its own dedicated connection.  If
your application is compute-bound, then your database is lightly loaded
and you won't mind opening a lot of connections.

The summary is that if your application is database-bound, or is
processing transactions, you don't need or even want connection pooling.

-jwb

Re: Database connection pooling... (beyond Apache::DBI)

Posted by Greg Stark <gs...@mit.edu>.

Sean Chittenden <se...@serverninjas.com> writes:

> 	Howdy.  We're all probably pretty familiar with Apache::DBI and
> the fact that it opens a database connection per apache process.  Sounds
> groovy and works well with only one or two servers.  Everything is gravy
> until you get a cluster of servers, ie 20-30 machines, each with 300+
> processes.  

300+ perl processes per machine? No way. The only way that would make _any_
sense is if your perl code is extremely i/o dependent and your perl code is
extremely light. Even then you're way better off having the i/o operations
queued quickly and processed asynchronously.

In all likelihood what you really want is more like 16-32 processes per
machine. That should be more than enough to peg the cpu at 100% even if your
pages are all database driven. Be sure to have separate servers for non-perl
code and a cluster of 20 * 32 processes should be able to handle a few million
hits per day even for loosely coded perl code (assuming well the database can
keep up of course).

32 * 30 is 900 connections. If you're building something of this order of
magnitude there are plenty of databases around that scale to 900 connections.
Oracle with MTS on suitable hardware shouldn't really blink at it, for
example. 

> 	Are there any great solutions to this problem aside from tossing a
> TON of RAM into the database server and turning up the number of database
> connections?  I have a few ideas of my own, but I don't want to reinvent
> the wheel and want to hear what's out there.  Thanks.  --SC

The DBI::Proxy scheme is worth experimenting with. If you try it under load I
would love to see the results. It has the advantage that additional features
could be hacked into DBI::Proxy that aren't supported by the underlying
database, such as asynchronous connections.

The alternatives are: 

1) a different architecture that allows for non 1-1 mapping between script
engine and web server processes. Ie, FastCGI. This doesn't completely solve
the problem since you may be doing substantial non-database processing in
these scripts as well.

2) database specific features such as Oracle MTS or Oracle Connection Manager.

> 	PS  I'd like to post the results of this discussion and compile it
> into something that would be suitable for the FAQ.

There have been a few previous similar discussions, you could start by reading
the archives.

-- 
greg

Re: Database connection pooling... (beyond Apache::DBI)

Posted by Bill McKinnon <mc...@isis2000.com>.

   Depending on your app, you may want to at least consider FastCGI. For
handling a single task that doesn't result in lots of data going back to
the client that needs to get hit MANY times a second, it is nearly ideal,
at least architecture-wise. If you have 100 different scripts that get
called with equal frequency and that all do DB stuff, it's not a good fit.
A mix of mod_perl and FastCGI can solve most problems--they tend to
complement each other, IMHO. If you're not familiar with fastcgi, it's a
protocol that lets an app start up once and then essentially handle
requests in a while() loop, allowing you to use just a single db
connection pretty effectively if you're doing lots of little (i.e. short)
queries or updates. There's an apache module called mod_fastcgi that's
available at www.fastcgi.com. Ok, I'l stop blasphemizing. :)

- Bill

On Wed, 23 Feb 2000, Sean Chittenden wrote:

> 	Howdy.  We're all probably pretty familiar with Apache::DBI and
> the fact that it opens a database connection per apache process.  Sounds
> groovy and works well with only one or two servers.  Everything is gravy
> until you get a cluster of servers, ie 20-30 machines, each with 300+
> processes.  I've looked into DBD::Proxy and DBI::ProxyServer, which seems
> really cool and I'm wondering if anyone has any experience with those two
> modules.  Yes? No?
> 	Are there any great solutions to this problem aside from tossing a
> TON of RAM into the database server and turning up the number of database
> connections?  I have a few ideas of my own, but I don't want to reinvent
> the wheel and want to hear what's out there.  Thanks.  --SC
> 
> 
> 	PS  I'd like to post the results of this discussion and compile it
> into something that would be suitable for the FAQ.
> 
> -- 
> Sean Chittenden
> sean.chittenden@usa.net
> 
>