You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Greg Ward <gw...@python.net> on 2006/09/25 18:01:43 UTC

Monitoring Tomcat load (4.1.30)

Hi -- we've recently had problems with Tomcat on one server getting
overloaded and then refusing further requests.  What happens is that
more and more request-processing threads get stuck on some slow
operation (eg. database query, opening a socket to another server),
until eventually Tomcat's thread pool is exhausted and it starts
responding with "503 Service temporarily unavailable".

Now, obviously the *real* fix is to get to the root of things and figure
out why those requests are slow (or blocked, or whatever).  But this
keeps happening for various different reasons, and when it does we get
angry customers on the line wanting to know what the hell happened to
the web server.  (We deploy a suite of related web applications on
dedicated servers to a hundred or so customers; one day a deadlock will
affect customer X, a few weeks later a network outage between servers
will affect customer Y, and so on.)

So I want to implement some sort of automatic load monitoring of Tomcat.
This should give us a rough idea of when things are about to go bad
*before* the customer even finds out about it (never mind picks up the
phone) -- and it's independent of what the underlying cause is.

Ideally, I'd like to know if the number of concurrent requests goes
above X for Y minutes, and raise the alarm if so.  This is across *all*
webapps running in the same container.

I've implemented a vile hack that hits Tomcat's process with SIGQUIT to
trigger a thread dump, then parses the thread dump to look for threads
that the JVM says are "runnable".  E.g. this:

  "Ajp13Processor[8009][33]" daemon prio=1 tid=0x0856a528 nid=0x2263 in Object.wait() [99bc4000..99bc487c]

is presumed to be an idle thread (it's waiting on a
org.apache.ajp.tomcat4.Ajp13Processor monitor).  But this:

  "Ajp13Processor[8009][28]" daemon prio=1 tid=0x0856c6d8 nid=0x2263 runnable [99dc7000..99dc887c]

is presumed to be processing a request (it's deep in the bowels of our
JDBC driver, reading from the database server ... which is what most of
our requests seem to spend most of their time doing).

This seems to work and it gives a rough-and-ready snapshot of how busy
Tomcat is at the moment.  If I run it every 60 sec for a while, I get
output like this:

  /var/log/tomcat/thread-dump-20060925_112753.log: 20/34
  /var/log/tomcat/thread-dump-20060925_112858.log: 17/34
  /var/log/tomcat/thread-dump-20060925_113003.log: 20/34
  /var/log/tomcat/thread-dump-20060925_113109.log: 20/34
  /var/log/tomcat/thread-dump-20060925_113214.log: 18/34
  /var/log/tomcat/thread-dump-20060925_113319.log: 21/34

where the first number is the count of "runnable" Ajp13Processor threads
(ie. concurrent requests) and the second number is the total number of
Ajp13Processors.

I have two concerns about this vile hack:

  * well, it *is* a vile hack -- is there a cleaner way to get this
    information out of Tomcat? (keeping in mind that we're running
    4.1.30)

  * just how hard on the JVM is it to get a thread dump?  I would
    probably decrease the frequency to every 10 min in production,
    but even so it makes me a bit nervous.

Thanks --

        Greg

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org