You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Jesse Defer <Je...@asu.edu> on 2015/02/23 19:03:19 UTC

mod_jk causing stuck Apache processes

I have a farm of Apache httpd servers proxying to Tomcat with mod_jk and I am having issues with Apache processes getting stuck (as seen by the W state in server-status).  I am sending to this list because the stack traces show httpd gets stuck in mod_jk.

httpd is configured for prefork and with 512 servers on start and maximum.  When the problem occurs we end up with nearly all 512 processes in the W state until we restart it.  The problem occurs more often when load is high but is not restricted to high load.  The problem started occuring more often when we increased the servers from 384 to 512.  The hosts have enough memory and do not swap.  The issue occurs intermitently and is not tied to a particular host or instance Tomcat (there are ~150 Tomcat instances).  The JkShmFile is on tmpfs.

Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1, APR-UTIL 1.5.4

The stuck httpd processes all show the same stack and strace:

pstack:
#0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
#1  0x00002b3440911656 in jk_shm_lock () from /usr/local/apache2/modules/mod_jk.so
#2  0x00002b3440917805 in ajp_maintain () from /usr/local/apache2/modules/mod_jk.so
#3  0x00002b3440906cac in maintain_workers () from /usr/local/apache2/modules/mod_jk.so
#4  0x00002b3440901140 in wc_maintain () from /usr/local/apache2/modules/mod_jk.so
#5  0x00002b34408f40b6 in jk_handler () from /usr/local/apache2/modules/mod_jk.so
#6  0x0000000000448eca in ap_run_handler ()
#7  0x000000000044cc92 in ap_invoke_handler ()
#8  0x000000000045e24f in ap_process_async_request ()
#9  0x000000000045e38f in ap_process_request ()
#10 0x000000000045ab65 in ap_process_http_connection ()
#11 0x00000000004530ba in ap_run_process_connection ()
#12 0x000000000046423a in child_main ()
#13 0x0000000000464544 in make_child ()
#14 0x00000000004649ae in prefork_run ()
#15 0x0000000000430634 in ap_run_mpm ()
#16 0x000000000042ad97 in main ()

strace:
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
time(NULL)                              = 1424711498
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0

Any help tracking this issue down would be appreciated.

Thanks,
Jesse DeFer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Am 23.02.2015 um 20:34 schrieb Rainer Jung:
> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
>> I have a farm of Apache httpd servers proxying to Tomcat with mod_jk
>> and I am having issues with Apache processes getting stuck (as seen by
>> the W state in server-status).  I am sending to this list because the
>> stack traces show httpd gets stuck in mod_jk.
>>
>> httpd is configured for prefork and with 512 servers on start and
>> maximum.  When the problem occurs we end up with nearly all 512
>> processes in the W state until we restart it.  The problem occurs more
>> often when load is high but is not restricted to high load.  The
>> problem started occuring more often when we increased the servers from
>> 384 to 512.  The hosts have enough memory and do not swap.  The issue
>> occurs intermitently and is not tied to a particular host or instance
>> Tomcat (there are ~150 Tomcat instances).  The JkShmFile is on tmpfs.
>
> Why on tmpfs?
>
>> Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1,
>> APR-UTIL 1.5.4
>>
>> The stuck httpd processes all show the same stack and strace:
>>
>> pstack:
>> #0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
>> #1  0x00002b3440911656 in jk_shm_lock () from
>> /usr/local/apache2/modules/mod_jk.so
>> #2  0x00002b3440917805 in ajp_maintain () from
>> /usr/local/apache2/modules/mod_jk.so
>> #3  0x00002b3440906cac in maintain_workers () from
>> /usr/local/apache2/modules/mod_jk.so
>> #4  0x00002b3440901140 in wc_maintain () from
>> /usr/local/apache2/modules/mod_jk.so
>> #5  0x00002b34408f40b6 in jk_handler () from
>> /usr/local/apache2/modules/mod_jk.so
>> #6  0x0000000000448eca in ap_run_handler ()
>> #7  0x000000000044cc92 in ap_invoke_handler ()
>> #8  0x000000000045e24f in ap_process_async_request ()
>> #9  0x000000000045e38f in ap_process_request ()
>> #10 0x000000000045ab65 in ap_process_http_connection ()
>> #11 0x00000000004530ba in ap_run_process_connection ()
>> #12 0x000000000046423a in child_main ()
>> #13 0x0000000000464544 in make_child ()
>> #14 0x00000000004649ae in prefork_run ()
>> #15 0x0000000000430634 in ap_run_mpm ()
>> #16 0x000000000042ad97 in main ()
>
> So mod_jk tries to get a lock on the shared memory before reading and
> updating some shared memory data. That as one of many things mod_jk does
> is normal. It would be not normal, if most processes seem to almost
> always sit in this stack.
>
>> strace:
>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> time(NULL)                              = 1424711498
>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
>> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
>>
>> Any help tracking this issue down would be appreciated.
>
> This doesn't look like mod_jk is hanging in the jk_shm_lock() but
> instead it looks like normal processing, locking and then unlocking.
> Locking and unlocking suceeds (return code 0).
>
> You didn't provide time stamps to get an idea, whether it is normal
> behavior or not. What is your request throughput (requests per second
> forwarded by mod_jk as long as it is running well)?
>
> I suspect something else is wrong. Have you checked, whether in fact
> mod_jk waits for the response from requests it has send to the backend
> and which do not return quickly, e.g. looking at a Java thread dump
> (not: heap dump) of the backend during the times you have problems?
>
> What are the error or warn messages in your mod_jk log file?
>
> Did you use "apachectl graceful" shortly before the problems start, or
> change configuration via the mod_jk status worker?
>
> What is special here, is the use of many processes plus many workers. So
> the lock is used quite a lot. Still the "global maintain" functionality
> which uses the jk_shm_lock() in the stack above should be called by each
> process only every worker.maintain seconds, by default every 60 seconds.
> And each project should quickly detect whether another process already
> did the global maintain. During the global maintain, for any ajp13
> worker there is really just a few simply local code statements. For any
> lb (load balancer) worker, there is a little more to do, especialy
> checking whether any members have failed long ago and should now be
> recovered. ut still those operations are all local, not going on the
> network.
>
> What is your workr struture? Are the 150 workers part of few load
> balancers or are they all in the worker.list? Can you show significant
> parts of your workers.properties? Are you using the non-default
> "pessimistic" locking?
>
> If your Apache uses a threaded APR library, you could try using
> JkWatchdogInterval. That moved the maintenance from the normal request
> processing code to a separate thread per Apache process.

Two more things:

- are you running with a high mod_jk log level, e.g. JkLogLevel debug or 
trace?

- using the worker or prefork MPM in combination with JkWatchdogInterval 
would reduce the amount of shm locking due to global maintenance calls, 
because it is done per process, not per thread. You might have other 
reasons for using prefork though. Have you?

Regards,

Rainer


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Am 03.03.2015 um 23:11 schrieb Jesse Defer:
>>>> any update on the behavior of the patched nodes? Did it behave better,
>> ie.
>>>> did the problem reoccur on other nodes but not on the patched ones,
>>>> or were all nodes mostly without problems during the last 2 days? Any
>>>> strange things on the patched ones? I'd like to add the patch for
>>>> 1.2.41 but your feedback would be very valuable.
>>>>
>>>> Thanks!
>>>>
>>>> Rainer
>>>
>>> It's been quiet, no stuck processes anywhere in the farm.  The patched
>>> jk is working fine, no issues found so far.  I'll continue to monitor
>> and let you know.
>>
>> Just to make sure: "no stuck processes anywhere in the farm" means we
>> can't be sure, that the patch helps, because the unpatched nodes didn't
>> have a problem either during that time. So we have to wait until the
>> problem happens a few more times so we can see whether the patch helps, or
>> the patched nodes have run long enough so that at least we don't expect a
>> negative impact from the patch.
>
> Rainer,
>
> The patch has not completely fixed the issue, I had a host today exhibit
> the behavior.  It does seem to have better behavior: stuck process don't hang
> around as long though there are still some >300 second ones.  It also seemed
> to recover by itself, though it took about 40 minutes.  It also had more
> idle processes on average, though it did hit 90% in W for a few minutes.
>
> I have also converted 4 (different) hosts to the event mpm and was tweaking
> the mpm settings all last week.  I think they are in a good state now and
> so far I haven't seen the issue on them.
>
> I now have 8 on stock mod_jk/prefork, 4 on your patched mod_jk/prefork, and 4
> on stock mod_jk/event.
>
> We are upgrading to httpd 2.4.12 on Friday and I am also gone on vacation
> all next week.  We haven't decided yet if we are switching to event at the
> same time, but it seems likely.  I am willing to continue testing your
> patch but I won't be able to resume until the week of the 16th and it may
> not be under prefork.

Thanks for your efforts.

I have committed the changes to mod_jk in 1665454. They will be part of 
version 1.2.41.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk causing stuck Apache processes

Posted by Jesse Defer <Je...@asu.edu>.
> >> any update on the behavior of the patched nodes? Did it behave better,
> ie.
> >> did the problem reoccur on other nodes but not on the patched ones,
> >> or were all nodes mostly without problems during the last 2 days? Any
> >> strange things on the patched ones? I'd like to add the patch for
> >> 1.2.41 but your feedback would be very valuable.
> >>
> >> Thanks!
> >>
> >> Rainer
> >
> > It's been quiet, no stuck processes anywhere in the farm.  The patched
> > jk is working fine, no issues found so far.  I'll continue to monitor
> and let you know.
> 
> Just to make sure: "no stuck processes anywhere in the farm" means we
> can't be sure, that the patch helps, because the unpatched nodes didn't
> have a problem either during that time. So we have to wait until the
> problem happens a few more times so we can see whether the patch helps, or
> the patched nodes have run long enough so that at least we don't expect a
> negative impact from the patch.

Rainer,

The patch has not completely fixed the issue, I had a host today exhibit 
the behavior.  It does seem to have better behavior: stuck process don't hang 
around as long though there are still some >300 second ones.  It also seemed 
to recover by itself, though it took about 40 minutes.  It also had more
idle processes on average, though it did hit 90% in W for a few minutes.

I have also converted 4 (different) hosts to the event mpm and was tweaking 
the mpm settings all last week.  I think they are in a good state now and
so far I haven't seen the issue on them.

I now have 8 on stock mod_jk/prefork, 4 on your patched mod_jk/prefork, and 4 
on stock mod_jk/event.

We are upgrading to httpd 2.4.12 on Friday and I am also gone on vacation 
all next week.  We haven't decided yet if we are switching to event at the 
same time, but it seems likely.  I am willing to continue testing your 
patch but I won't be able to resume until the week of the 16th and it may 
not be under prefork.

Respectfully,

Jesse

Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Am 27.02.2015 um 18:15 schrieb Jesse Defer:
>> Hi Jesse,
>>
>> any update on the behavior of the patched nodes? Did it behave better, ie.
>> did the problem reoccur on other nodes but not on the patched ones, or
>> were all nodes mostly without problems during the last 2 days? Any strange
>> things on the patched ones? I'd like to add the patch for 1.2.41 but your
>> feedback would be very valuable.
>>
>> Thanks!
>>
>> Rainer
>
> It's been quiet, no stuck processes anywhere in the farm.  The patched jk is
> working fine, no issues found so far.  I'll continue to monitor and let you know.

Just to make sure: "no stuck processes anywhere in the farm" means we 
can't be sure, that the patch helps, because the unpatched nodes didn't 
have a problem either during that time. So we have to wait until the 
problem happens a few more times so we can see whether the patch helps, 
or the patched nodes have run long enough so that at least we don't 
expect a negative impact from the patch.

Regards,

Rainer




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk causing stuck Apache processes

Posted by Jesse Defer <Je...@asu.edu>.
> Hi Jesse,
> 
> any update on the behavior of the patched nodes? Did it behave better, ie.
> did the problem reoccur on other nodes but not on the patched ones, or
> were all nodes mostly without problems during the last 2 days? Any strange
> things on the patched ones? I'd like to add the patch for 1.2.41 but your
> feedback would be very valuable.
> 
> Thanks!
> 
> Rainer

It's been quiet, no stuck processes anywhere in the farm.  The patched jk is 
working fine, no issues found so far.  I'll continue to monitor and let you know.

Respectfully,

Jesse

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Hi Jesse,

any update on the behavior of the patched nodes? Did it behave better, 
ie. did the problem reoccur on other nodes but not on the patched ones, 
or were all nodes mostly without problems during the last 2 days? Any 
strange things on the patched ones? I'd like to add the patch for 1.2.41 
but your feedback would be very valuable.

Thanks!

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk causing stuck Apache processes

Posted by Jesse Defer <Je...@asu.edu>.
> -----Original Message-----
> From: Rainer Jung [mailto:rainer.jung@kippdata.de]
> Sent: Wednesday, February 25, 2015 3:41 AM
> To: Tomcat Users List
> Subject: Re: mod_jk causing stuck Apache processes
> 
> So here is a patch for version 1.2.40:
> 
> http://people.apache.org/~rjung/patches/jk_shm_lock.patch
> 
> The patch is non-trivial though. It reduces the load on the lock by
> 
> - moving some trivial stuff done for each lb member in a separate
> lock/unlock block (per member) to the lb itself, using its own lock/unlock
> block. That doesn't reduce the total time locks are held, but the number
> of lock/unlock ops.
> 
> - using a different mechanism to check, whether the global maintenance
> should actually run (based on atomics and a global decision instead of a
> per lb decision). That should be more efficient and again reduces
> lock/unlock ops, because the check can now be done without the lock and
> only if the global maintenance has to run is the locking needed.
> Introduces atomics though.
> 
> You would have to rebuild mod_jk completely, not only the changed files,
> because of header file changes. So do a "make clean" before the "make".
> 
> There are no configuration changes needed. You could run one of your
> Apaches with the patched mod_jk and compare behavior. Give it a try on a
> test system first please. I haven't done a thorough testing myself yet.
> 
> I plan to include this in the next version (1.2.41), which is not far
> away. Testing and checking whether it helps would be really helpful.
> 
> Regards and thanks
> 
> Rainer

I've put the patched mod_jk on 4 of our 16 servers and it's working good 
so far.  I will continue to monitor and let you know.

Respectfully,

Jesse

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Hi Jesse,

Am 24.02.2015 um 19:27 schrieb Jesse Defer:
>> -----Original Message-----
>> From: Rainer Jung [mailto:rainer.jung@kippdata.de]
>> Sent: Monday, February 23, 2015 5:00 PM
>> To: Tomcat Users List
>> Subject: Re: mod_jk causing stuck Apache processes
>>
>> Am 24.02.2015 um 00:29 schrieb Jesse Defer:

>> I suggest:
>>
>> - check whether there's a performance problem on the webapp side when it
>> happens next time
>>     - check thread dumps
>>     - check GC log
>
> No long pauses that I can see.
>
>>     - activate Tomcat access log and add "%D" to the pattern, so that you
>> can analyze request duration on the Tomcat side post-mortem
>>     - using the request duration logs, you can also check, whether a
>> potential performance problem happens on all nodes or only on some.
>> Depending on the result, the GC log check and thread dump creation and
>> check would be easier or harder (more work).
>>
>> - run strace with time stamps plus system call duration next time the
>> problem occurs
>
> http://pastebin.com/raw.php?i=X09ZTW4V
>
>> - also add %D to the LogFormat of your Apache access log. Do also add the
>> name of the load balancer member that handled the request to the access
>> log. Mod_jk provides "notes", that can be logged in the access log (look
>> for "JK_LB_LAST_NAME" in
>> http://tomcat.apache.org/connectors-doc/reference/apache.html)
>>
>> - prepare switching to a threaded MPM, be it worker or event. Use
>> JkWatchdogInterval there.
>>
>> - put JkShmFile to a local file system
>>
>> - check whether your system clock on the Apache servers works correctly
>> and doesn't jump (common virtualization problem)
>
> NTP is configured on all hosts and we monitor drift with collectd every 10
> seconds, no anomalies.
>
>> - configure the jk status worker and check, whether many of your workers
>> are in ERROR state. You can also retrieve info from the JK status worker
>> in a simple to parse text format and automate retrieval and extraction of
>> data (similar to what one can do with the Apache server-status)
>
> I haven't seen the status worker show any workers in error.
>
> Here are some logs with request duration.  First entry in each pair is httpd,
> second is tomcat.  It is interesting that the tomcat response is very fast
> but in two of the cases the apache response is very slow.  I wasn't able to
> correlate the JVM stack traces for these, maybe because they executed quickly
> and weren't blocked.
>
> 10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" pid=25649 tid=47017797431248 time_s=1 time_us=1174645 x_id=VOy2LYHbdfEAAGQxXQkAAAC0 content_type="text/html" last_modified="-"
> cache_control="no-cache" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
> 10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" time_ms=115 tomcat_thread=ajp-62014-31 x_id=VOy2LYHbdfEAAGQxXQkAAAC0 app=my
>
> 10.142.64.251 - - [24/Feb/2015:10:31:02 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" "Mozilla/5.0" pid=25697 tid=47017797431248 time_s=246 time_us=246422721 x_id=VOy1VoHbdfEAAGRhuqQAAADq content_type="text/html" last_modified="-" cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
> 10.142.64.251 - - [24/Feb/2015:10:35:09 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" time_ms=53 tomcat_thread=ajp-62014-32 x_id=VOy1VoHbdfEAAGRhuqQAAADq app=my
>
> 10.142.64.251 - - [24/Feb/2015:10:30:41 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" "Mozilla/5.0" pid=25772 tid=47017797431248 time_s=371 time_us=371698757 x_id=VOy1QYHbdfEAAGSsv-oAAAEy content_type="text/html" last_modified="-" cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
> 10.142.64.251 - - [24/Feb/2015:10:36:52 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" time_ms=57 tomcat_thread=ajp-62014-5 x_id=VOy1QYHbdfEAAGSsv-oAAAEy app=my
>
> I let one server stay busy for about an hour to collect some logs.
> There are some slow requests for this app in the tomcat logs, but few over
> 60 seconds and only 1 over 120 seconds (for all 17 of this app's tomcats).
> Meanwhile there are thousands of httpd requests over 120s from just the
> one web server with the stuck processes.  Plotting the request time for
> the entire httpd farm shows the affected server has a large jump but
> no change in the other servers.

So here is a patch for version 1.2.40:

http://people.apache.org/~rjung/patches/jk_shm_lock.patch

The patch is non-trivial though. It reduces the load on the lock by

- moving some trivial stuff done for each lb member in a separate 
lock/unlock block (per member) to the lb itself, using its own 
lock/unlock block. That doesn't reduce the total time locks are held, 
but the number of lock/unlock ops.

- using a different mechanism to check, whether the global maintenance 
should actually run (based on atomics and a global decision instead of a 
per lb decision). That should be more efficient and again reduces 
lock/unlock ops, because the check can now be done without the lock and 
only if the global maintenance has to run is the locking needed. 
Introduces atomics though.

You would have to rebuild mod_jk completely, not only the changed files, 
because of header file changes. So do a "make clean" before the "make".

There are no configuration changes needed. You could run one of your 
Apaches with the patched mod_jk and compare behavior. Give it a try on a 
test system first please. I haven't done a thorough testing myself yet.

I plan to include this in the next version (1.2.41), which is not far 
away. Testing and checking whether it helps would be really helpful.

Regards and thanks

Rainer


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk causing stuck Apache processes

Posted by Jesse Defer <Je...@asu.edu>.

> -----Original Message-----
> From: Rainer Jung [mailto:rainer.jung@kippdata.de]
> Sent: Monday, February 23, 2015 5:00 PM
> To: Tomcat Users List
> Subject: Re: mod_jk causing stuck Apache processes
> 
> Am 24.02.2015 um 00:29 schrieb Jesse Defer:
> >> -----Original Message-----
> >> Am 23.02.2015 um 22:53 schrieb Jesse Defer:
> >>>> -----Original Message-----
> >>>> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
> 
> ...
> 
> >> How many Apache web servers sit in front of the Tomcats? Is each of
> >> the Apaches configured with 512 processes?
> >
> > All Tomcats are configured the same, maxThreads is 4096, apr, Tomcat 6.
> > 16 apaches, all configured identically with 512 processes, the same
> > worker.properties, etc.
> 
> OK. So less threads than 16*512. But the apr connector - like the nio one
> - can handle more connections than it has threads. Not more in-flight
> requests, but for both connectors idle connections (keep-alive
> connections) are cheap and do not need a pool thread.
> 
> > Our busiest Tomcats peak at about 2000 connections to their AJP port
> > from the web farm.  I'm not sure how closely connections relate to
> > thread usage, I haven't looked too closely at the JVM yet.
> 
> For BIO one connection one thread. For APR and NIO it's roughly one in-
> flight request one thread. Connections waiting for the next request to
> arrive are cheap and handled by few separate poller threads.
> 
> ...
> 
> >> And only the lb is added to the worker.list? Or also the individual
> >> AJP workers?
> >
> > Just the LBs.
> 
> OK
> 
> >>>> If your Apache uses a threaded APR library, you could try using
> >>>> JkWatchdogInterval. That moved the maintenance from the normal
> >>>> request processing code to a separate thread per Apache process.
> >>
> >> I suspect that once your app (one or some of your apps) gets slow,
> >> requests pile up in Apache and >Apache can no longer connect to
> >> Tomcat, because it doesn't have free threads. Some Apache children
> >> still can connect, because the connect is handled by the OS
> >> underneath Tomcat, but those connections do not get the awaited cping
> reply.
> >
> > Shouldn't the ping timeout cause the processes to exit the wait state
> > if the cping reply hasn't been received?  I have seen the SS column in
> > server-status grow into the thousands.
> 
> Yes it would. But it could be a mixture: lots of Apache processes waiting
> for hanging or slow Tomcat responses due to an aplication or GC
> bottleneck, and some on-top requests failing in cping.
> 
> >> In addition to the Thread Dumps, you should also activate GC-Logs for
> >> your Tomcat JVMs.
> >
> > I have gc logging enabled already.
> 
> ACK
> 
> >>>> - using the worker or prefork MPM in combination with
> >>>> JkWatchdogInterval would reduce the amount of shm locking due to
> >>>> global maintenance calls, because it is done per process, not per
> >>>> thread. You might have other reasons for using prefork though. Have
> >> you?
> >>>
> >>> Use of prefork is historical.  We have been looking at going to
> >>> event, but Haven't made the switch yet.
> >>
> >> If you are using many Apaches each with 512 children, you will most
> >> likely need to use an NIO connector on the Tomcat processes and would
> >> also benefit from worker or event MPM for Apache. Each process e.g.
> >> with
> >> 50 or 100 threads and a relatively short connection_pool_timeout like
> e.g.
> >> 60 seconds.
> >
> > I am going to bring up accelerating the switch to the event MPM with
> > my team.  Will NIO give us more scalability than APR?
> 
> Not really, both are good at connection scalability. Maybe don't change to
> much at the same time. I'd say NIO is a bit more mainstream than APR, but
> your mileage my vary.
> 
> >> What was the reason, that you started suspecting the fcntl()
> >> underneath jk_shm_lock? Are we sure, that the stack you provided is
> >> really the typical one? And was the strace dumping all system calls
> >> or only the fcntl calls?
> >
> > I can't be sure it's the typical one, I only know every time I sample
> > the stack of a stuck process that's what I get, plus the strace shows
> > no other system calls except time().  It seems like it's stuck in a
> fairly tight loop somewhere.
> > I haven't looked too deeply in the mod_jk code though and haven't
> > collected any system call statistics.
> >
> > I don't necessarily suspect fcntl, but it's getting called a lot (the
> > strace is not filtered).  System time jumps when the issue is
> > happening and the fcntl looks like the vast majority of system calls.
> > Since the issue got worse when we added more processes it seems
> plausible it could be lock contention.
> 
> Hmm, but lock contention should result in the fcntl call to take more and
> more time. Could it be a problem with the system time? That somehow the
> "don't run the maintenance for the next 60 seconds" rule doesn't get
> effective because the clock jumps?
> 
> I suggest:
> 
> - check whether there's a performance problem on the webapp side when it
> happens next time
>    - check thread dumps
>    - check GC log

No long pauses that I can see.

>    - activate Tomcat access log and add "%D" to the pattern, so that you
> can analyze request duration on the Tomcat side post-mortem
>    - using the request duration logs, you can also check, whether a
> potential performance problem happens on all nodes or only on some.
> Depending on the result, the GC log check and thread dump creation and
> check would be easier or harder (more work).
> 
> - run strace with time stamps plus system call duration next time the
> problem occurs

http://pastebin.com/raw.php?i=X09ZTW4V

> - also add %D to the LogFormat of your Apache access log. Do also add the
> name of the load balancer member that handled the request to the access
> log. Mod_jk provides "notes", that can be logged in the access log (look
> for "JK_LB_LAST_NAME" in
> http://tomcat.apache.org/connectors-doc/reference/apache.html)
> 
> - prepare switching to a threaded MPM, be it worker or event. Use
> JkWatchdogInterval there.
> 
> - put JkShmFile to a local file system
> 
> - check whether your system clock on the Apache servers works correctly
> and doesn't jump (common virtualization problem)

NTP is configured on all hosts and we monitor drift with collectd every 10
seconds, no anomalies.

> - configure the jk status worker and check, whether many of your workers
> are in ERROR state. You can also retrieve info from the JK status worker
> in a simple to parse text format and automate retrieval and extraction of
> data (similar to what one can do with the Apache server-status)

I haven't seen the status worker show any workers in error.

> Regards,
> 
> Rainer

Here are some logs with request duration.  First entry in each pair is httpd, 
second is tomcat.  It is interesting that the tomcat response is very fast 
but in two of the cases the apache response is very slow.  I wasn't able to 
correlate the JVM stack traces for these, maybe because they executed quickly 
and weren't blocked.

10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" pid=25649 tid=47017797431248 time_s=1 time_us=1174645 x_id=VOy2LYHbdfEAAGQxXQkAAAC0 content_type="text/html" last_modified="-" 
cache_control="no-cache" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" time_ms=115 tomcat_thread=ajp-62014-31 x_id=VOy2LYHbdfEAAGQxXQkAAAC0 app=my

10.142.64.251 - - [24/Feb/2015:10:31:02 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" "Mozilla/5.0" pid=25697 tid=47017797431248 time_s=246 time_us=246422721 x_id=VOy1VoHbdfEAAGRhuqQAAADq content_type="text/html" last_modified="-" cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:35:09 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" time_ms=53 tomcat_thread=ajp-62014-32 x_id=VOy1VoHbdfEAAGRhuqQAAADq app=my

10.142.64.251 - - [24/Feb/2015:10:30:41 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" "Mozilla/5.0" pid=25772 tid=47017797431248 time_s=371 time_us=371698757 x_id=VOy1QYHbdfEAAGSsv-oAAAEy content_type="text/html" last_modified="-" cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:36:52 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" time_ms=57 tomcat_thread=ajp-62014-5 x_id=VOy1QYHbdfEAAGSsv-oAAAEy app=my

I let one server stay busy for about an hour to collect some logs.
There are some slow requests for this app in the tomcat logs, but few over
60 seconds and only 1 over 120 seconds (for all 17 of this app's tomcats).  
Meanwhile there are thousands of httpd requests over 120s from just the 
one web server with the stuck processes.  Plotting the request time for
the entire httpd farm shows the affected server has a large jump but
no change in the other servers.

Respectfully,

Jesse

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Am 24.02.2015 um 00:29 schrieb Jesse Defer:
>> -----Original Message-----
>> Am 23.02.2015 um 22:53 schrieb Jesse Defer:
>>>> -----Original Message-----
>>>> Am 23.02.2015 um 19:03 schrieb Jesse Defer:

...

>> How many Apache web servers sit in front of the Tomcats? Is each of the
>> Apaches configured with 512 processes?
>
> All Tomcats are configured the same, maxThreads is 4096, apr, Tomcat 6.
> 16 apaches, all configured identically with 512 processes, the same
> worker.properties, etc.

OK. So less threads than 16*512. But the apr connector - like the nio 
one - can handle more connections than it has threads. Not more 
in-flight requests, but for both connectors idle connections (keep-alive 
connections) are cheap and do not need a pool thread.

> Our busiest Tomcats peak at about 2000 connections to their AJP port from the
> web farm.  I'm not sure how closely connections relate to thread usage, I
> haven't looked too closely at the JVM yet.

For BIO one connection one thread. For APR and NIO it's roughly one 
in-flight request one thread. Connections waiting for the next request 
to arrive are cheap and handled by few separate poller threads.

...

>> And only the lb is added to the worker.list? Or also the individual AJP
>> workers?
>
> Just the LBs.

OK

>>>> If your Apache uses a threaded APR library, you could try using
>>>> JkWatchdogInterval. That moved the maintenance from the normal
>>>> request processing code to a separate thread per Apache process.
>>
>> I suspect that once your app (one or some of your apps) gets slow,
>> requests pile up in Apache and >Apache can no longer connect to Tomcat,
>> because it doesn't have free threads. Some Apache children still can
>> connect, because the connect is handled by the OS underneath Tomcat, but
>> those connections do not get the awaited cping reply.
>
> Shouldn't the ping timeout cause the processes to exit the wait state if
> the cping reply hasn't been received?  I have seen the SS column in
> server-status grow into the thousands.

Yes it would. But it could be a mixture: lots of Apache processes 
waiting for hanging or slow Tomcat responses due to an aplication or GC 
bottleneck, and some on-top requests failing in cping.

>> In addition to the Thread Dumps, you should also activate GC-Logs for your
>> Tomcat JVMs.
>
> I have gc logging enabled already.

ACK

>>>> - using the worker or prefork MPM in combination with
>>>> JkWatchdogInterval would reduce the amount of shm locking due to
>>>> global maintenance calls, because it is done per process, not per
>>>> thread. You might have other reasons for using prefork though. Have
>> you?
>>>
>>> Use of prefork is historical.  We have been looking at going to event,
>>> but Haven't made the switch yet.
>>
>> If you are using many Apaches each with 512 children, you will most likely
>> need to use an NIO connector on the Tomcat processes and would also
>> benefit from worker or event MPM for Apache. Each process e.g. with
>> 50 or 100 threads and a relatively short connection_pool_timeout like e.g.
>> 60 seconds.
>
> I am going to bring up accelerating the switch to the event MPM with my
> team.  Will NIO give us more scalability than APR?

Not really, both are good at connection scalability. Maybe don't change 
to much at the same time. I'd say NIO is a bit more mainstream than APR, 
but your mileage my vary.

>> What was the reason, that you started suspecting the fcntl() underneath
>> jk_shm_lock? Are we sure, that the stack you provided is really the
>> typical one? And was the strace dumping all system calls or only the fcntl
>> calls?
>
> I can't be sure it's the typical one, I only know every time I sample the stack
> of a stuck process that's what I get, plus the strace shows no other system
> calls except time().  It seems like it's stuck in a fairly tight loop somewhere.
> I haven't looked too deeply in the mod_jk code though and haven't collected
> any system call statistics.
>
> I don't necessarily suspect fcntl, but it's getting called a lot (the strace
> is not filtered).  System time jumps when the issue is happening and the fcntl
> looks like the vast majority of system calls.  Since the issue got worse when
> we added more processes it seems plausible it could be lock contention.

Hmm, but lock contention should result in the fcntl call to take more 
and more time. Could it be a problem with the system time? That somehow 
the "don't run the maintenance for the next 60 seconds" rule doesn't get 
effective because the clock jumps?

I suggest:

- check whether there's a performance problem on the webapp side when it 
happens next time
   - check thread dumps
   - check GC log
   - activate Tomcat access log and add "%D" to the pattern, so that you 
can analyze request duration on the Tomcat side post-mortem
   - using the request duration logs, you can also check, whether a 
potential performance problem happens on all nodes or only on some. 
Depending on the result, the GC log check and thread dump creation and 
check would be easier or harder (more work).

- run strace with time stamps plus system call duration next time the 
problem occurs

- also add %D to the LogFormat of your Apache access log. Do also add 
the name of the load balancer member that handled the request to the 
access log. mod_jk provides "notes", that can be logged in the access 
log (look for "JK_LB_LAST_NAME" in 
http://tomcat.apache.org/connectors-doc/reference/apache.html)

- prepare switching to a threaded MPM, be it worker or event. Use 
JkWatchdogInterval there.

- put JkShmFile to a local file system

- check whether your system clock on the Apache servers works correctly 
and doesn't jump (common virtualization problem)

- configure the jk status worker and check, whether many of your workers 
are in ERROR state. You can also retrieve info from the JK status worker 
in a simple to parse text format and automate retrieval and extraction 
of data (similar to what one can do with the Apache server-status)

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk causing stuck Apache processes

Posted by Jesse Defer <Je...@asu.edu>.
> -----Original Message-----
> Am 23.02.2015 um 22:53 schrieb Jesse Defer:
> >> -----Original Message-----
> >> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
> >>> I have a farm of Apache httpd servers proxying to Tomcat with mod_jk
> >>> and
> >> I am having issues with Apache processes getting stuck (as seen by
> >> the W state in server-status).  I am sending to this list because the
> >> stack traces show httpd gets stuck in mod_jk.
> >>>
> >>> httpd is configured for prefork and with 512 servers on start and
> >> maximum.  When the problem occurs we end up with nearly all 512
> >> processes in the W state until we restart it.  The problem occurs
> >> more often when load is high but is not restricted to high load.  The
> >> problem started occuring more often when we increased the servers
> >> from 384 to 512.  The hosts have enough memory and do not swap.  The
> >> issue occurs intermitently and is not tied to a particular host or
> >> instance Tomcat (there are ~150 Tomcat instances).  The JkShmFile is on
> tmpfs.
> >>
> >> Why on tmpfs?
> >
> > Not sure, might be because of poor local IO performance (hosts are
> > VMs) or historic use of network filesystems.
> 
> I haven't checked the details, but I don't expect the shared memory
> activity to be reflected by FS activity. I don't have an indication, that
> tmpfs is a problem for the locking either, but once we run out of ideas,
> you could try whether moving away from tmpfs and network filesystems for
> the JkShmFile and instead using a path to a local file for JkShmFile
> helps.
> 
> >>> Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR
> >>> 1.5.1, APR-UTIL 1.5.4
> >>>
> >>> The stuck httpd processes all show the same stack and strace:
> >>>
> >>> pstack:
> >>> #0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
> >>> #1  0x00002b3440911656 in jk_shm_lock () from
> >>> /usr/local/apache2/modules/mod_jk.so
> >>> #2  0x00002b3440917805 in ajp_maintain () from
> >>> /usr/local/apache2/modules/mod_jk.so
> >>> #3  0x00002b3440906cac in maintain_workers () from
> >>> /usr/local/apache2/modules/mod_jk.so
> >>> #4  0x00002b3440901140 in wc_maintain () from
> >>> /usr/local/apache2/modules/mod_jk.so
> >>> #5  0x00002b34408f40b6 in jk_handler () from
> >>> /usr/local/apache2/modules/mod_jk.so
> >>> #6  0x0000000000448eca in ap_run_handler ()
> >>> #7  0x000000000044cc92 in ap_invoke_handler ()
> >>> #8  0x000000000045e24f in ap_process_async_request ()
> >>> #9  0x000000000045e38f in ap_process_request ()
> >>> #10 0x000000000045ab65 in ap_process_http_connection ()
> >>> #11 0x00000000004530ba in ap_run_process_connection ()
> >>> #12 0x000000000046423a in child_main ()
> >>> #13 0x0000000000464544 in make_child ()
> >>> #14 0x00000000004649ae in prefork_run ()
> >>> #15 0x0000000000430634 in ap_run_mpm ()
> >>> #16 0x000000000042ad97 in main ()
> >>
> >> So mod_jk tries to get a lock on the shared memory before reading and
> >> updating some shared memory data. That as one of many things mod_jk
> >> does is normal. It would be not normal, if most processes seem to
> >> almost always sit in this stack.
> >>
> >>> strace:
> >>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1})
> >>> =
> >>> 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0,
> >>> len=1}) =
> >> 0
> >>> time(NULL)                              = 1424711498
> >>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1})
> >>> =
> >>> 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0,
> >>> len=1}) = 0 fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET,
> >>> start=0,
> >>> len=1}) = 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET,
> >>> start=0, len=1}) = 0 fcntl(19, F_SETLKW, {type=F_WRLCK,
> >>> whence=SEEK_SET, start=0, len=1}) = 0 fcntl(19, F_SETLKW,
> >>> {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
> >>>
> >>> Any help tracking this issue down would be appreciated.
> >>
> >> This doesn't look like mod_jk is hanging in the jk_shm_lock() but
> >> instead it looks like normal processing, locking and then unlocking.
> >> Locking and unlocking suceeds (return code 0).
> >>
> >> You didn't provide time stamps to get an idea, whether it is normal
> >> behavior or not. What is your request throughput (requests per second
> >> forwarded by mod_jk as long as it is running well)?
> >
> > Next time it happens I will capture timestamps, but it scrolls pretty
> > quickly and doesn't seem to get hung up on any calls.  We have very
> > seasonal traffic, but normally around 30/sec.  The high period peaks at
> around 200/sec.
> 
> OK, strace should be able to print time stamps plus the time that the call
> took.
> 
> Load around 30/s up to 200/s is not extreme.
> 
> >> I suspect something else is wrong. Have you checked, whether in fact
> >> mod_jk waits for the response from requests it has send to the
> >> backend and which do not return quickly, e.g. looking at a Java
> >> thread dump
> >> (not: heap dump) of the backend during the times you have problems?
> >
> > I haven't, next time it happens I'll look.
> >
> >> What are the error or warn messages in your mod_jk log file?
> >
> > Here is one instance of each type of warn/error in the logs of an
> affected host.
> > All of these are fairly common but we have tomcats going up and down
> > all the time and I think most are related to that.  Log level is info.
> >
> > [Mon Feb 23 03:05:37 2015] [25833:47348544151936] [error]
> ajp_service::jk_ajp_common.c (2693): (my27) connecting to tomcat failed.
> > [Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error]
> > ajp_connect_to_endpoint::jk_ajp_common.c (1050): (ead21) cping/cpong
> > after connecting to the backend server failed (errno=110) [Mon Feb 23
> > 03:05:39 2015] [25944:47348544151936] [error]
> > ajp_send_request::jk_ajp_common.c (1663): (ead21) connecting to
> > backend failed. Tomcat is probably not started or is listening on the
> > wrong port (errno=110) [Mon Feb 23 12:59:18 2015]
> > [25927:47348544151936] [error] ajp_get_reply::jk_ajp_common.c (2176):
> > (dir16) Tomcat is down or refused connection. No response has been
> > sent to the client (yet) [Mon Feb 23 14:12:12 2015]
> > [26170:47348544151936] [warn] ajp_handle_cping_cpong::jk_ajp_common.c
> > (965): awaited reply cpong, received 3 instead. Closing connection
> 
> I don't expect those to be consequences of a mod_jk problem. More likely
> it is the other way round. How many threads do you have configured for
> your AJP connector? Which connector (bio, nio, apr) do you use for Tomcat?
> 
> How many Apache web servers sit in front of the Tomcats? Is each of the
> Apaches configured with 512 processes?

All Tomcats are configured the same, maxThreads is 4096, apr, Tomcat 6.  
16 apaches, all configured identically with 512 processes, the same 
worker.properties, etc.

Our busiest Tomcats peak at about 2000 connections to their AJP port from the 
web farm.  I'm not sure how closely connections relate to thread usage, I 
haven't looked too closely at the JVM yet.

> >> Did you use "apachectl graceful" shortly before the problems start,
> >> or change configuration via the mod_jk status worker?
> >
> > No.
> >
> >> What is special here, is the use of many processes plus many workers.
> >> So the lock is used quite a lot. Still the "global maintain"
> >> functionality which uses the jk_shm_lock() in the stack above should
> >> be called by each process only every worker.maintain seconds, by
> default every 60 seconds.
> >> And each project should quickly detect whether another process
> >> already did the global maintain. During the global maintain, for any
> >> ajp13 worker there is really just a few simply local code statements.
> >> For any lb (load
> >> balancer) worker, there is a little more to do, especialy checking
> >> whether any members have failed long ago and should now be recovered.
> >> ut still those operations are all local, not going on the network.
> >>
> >> What is your workr struture? Are the 150 workers part of few load
> >> balancers or are they all in the worker.list? Can you show
> >> significant parts of your workers.properties? Are you using the
> >> non-default "pessimistic" locking?
> >
> > I misspoke, we have ~150 applications so that would be the load
> > balancer count.  Total worker count between all the LBs would be ~700.
> > We are using the default locking.  Here is one LB:
> >
> > # 109t
> > worker.109t1.port=62037
> > worker.109t1.host=java1
> > worker.109t1.type=ajp13
> > worker.109t1.lbfactor=1
> > worker.109t1.connection_pool_timeout=539
> > worker.109t1.socket_keepalive=true
> > worker.109t1.ping_mode=CP
> > worker.109t1.ping_timeout=5000
> > worker.109t1.redirect=109t2
> >
> > worker.109t2.reference=worker.109t1
> > worker.109t2.host=java2
> > worker.109t2.redirect=109t3
> >
> > worker.109t3.reference=worker.109t1
> > worker.109t3.host=java3
> > worker.109t3.redirect=109t4
> >
> > worker.109t4.reference=worker.109t1
> > worker.109t4.host=java4
> > worker.109t4.redirect=109t1
> >
> > worker.109tlb.type=lb
> > worker.109tlb.balance_workers=109t1,109t2,109t3,109t4
> > worker.109tlb.sticky_session=true
> 
> And only the lb is added to the worker.list? Or also the individual AJP
> workers?

Just the LBs.

> >> If your Apache uses a threaded APR library, you could try using
> >> JkWatchdogInterval. That moved the maintenance from the normal
> >> request processing code to a separate thread per Apache process.
> 
> I suspect that once your app (one or some of your apps) gets slow,
> requests pile up in Apache and >Apache can no longer connect to Tomcat,
> because it doesn't have free threads. Some Apache children still can
> connect, because the connect is handled by the OS underneath Tomcat, but
> those connections do not get the awaited cping reply.

Shouldn't the ping timeout cause the processes to exit the wait state if
the cping reply hasn't been received?  I have seen the SS column in 
server-status grow into the thousands.

> In addition to the Thread Dumps, you should also activate GC-Logs for your
> Tomcat JVMs.

I have gc logging enabled already.
 
> >> - using the worker or prefork MPM in combination with
> >> JkWatchdogInterval would reduce the amount of shm locking due to
> >> global maintenance calls, because it is done per process, not per
> >> thread. You might have other reasons for using prefork though. Have
> you?
> >
> > Use of prefork is historical.  We have been looking at going to event,
> > but Haven't made the switch yet.
> 
> If you are using many Apaches each with 512 children, you will most likely
> need to use an NIO connector on the Tomcat processes and would also
> benefit from worker or event MPM for Apache. Each process e.g. with
> 50 or 100 threads and a relatively short connection_pool_timeout like e.g.
> 60 seconds.

I am going to bring up accelerating the switch to the event MPM with my
team.  Will NIO give us more scalability than APR?

> What was the reason, that you started suspecting the fcntl() underneath
> jk_shm_lock? Are we sure, that the stack you provided is really the
> typical one? And was the strace dumping all system calls or only the fcntl
> calls?

I can't be sure it's the typical one, I only know every time I sample the stack 
of a stuck process that's what I get, plus the strace shows no other system 
calls except time().  It seems like it's stuck in a fairly tight loop somewhere.  
I haven't looked too deeply in the mod_jk code though and haven't collected
any system call statistics.

I don't necessarily suspect fcntl, but it's getting called a lot (the strace
is not filtered).  System time jumps when the issue is happening and the fcntl
looks like the vast majority of system calls.  Since the issue got worse when 
we added more processes it seems plausible it could be lock contention.
 
> Regards,
> 
> Rainer

Respectfully,

Jesse

Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Am 23.02.2015 um 22:53 schrieb Jesse Defer:
>> -----Original Message-----
>> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
>>> I have a farm of Apache httpd servers proxying to Tomcat with mod_jk and
>> I am having issues with Apache processes getting stuck (as seen by the W
>> state in server-status).  I am sending to this list because the stack
>> traces show httpd gets stuck in mod_jk.
>>>
>>> httpd is configured for prefork and with 512 servers on start and
>> maximum.  When the problem occurs we end up with nearly all 512 processes
>> in the W state until we restart it.  The problem occurs more often when
>> load is high but is not restricted to high load.  The problem started
>> occuring more often when we increased the servers from 384 to 512.  The
>> hosts have enough memory and do not swap.  The issue occurs intermitently
>> and is not tied to a particular host or instance Tomcat (there are ~150
>> Tomcat instances).  The JkShmFile is on tmpfs.
>>
>> Why on tmpfs?
>
> Not sure, might be because of poor local IO performance (hosts are VMs) or
> historic use of network filesystems.

I haven't checked the details, but I don't expect the shared memory 
activity to be reflected by FS activity. I don't have an indication, 
that tmpfs is a problem for the locking either, but once we run out of 
ideas, you could try whether moving away from tmpfs and network 
filesystems for the JkShmFile and instead using a path to a local file 
for JkShmFile helps.

>>> Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1,
>>> APR-UTIL 1.5.4
>>>
>>> The stuck httpd processes all show the same stack and strace:
>>>
>>> pstack:
>>> #0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
>>> #1  0x00002b3440911656 in jk_shm_lock () from
>>> /usr/local/apache2/modules/mod_jk.so
>>> #2  0x00002b3440917805 in ajp_maintain () from
>>> /usr/local/apache2/modules/mod_jk.so
>>> #3  0x00002b3440906cac in maintain_workers () from
>>> /usr/local/apache2/modules/mod_jk.so
>>> #4  0x00002b3440901140 in wc_maintain () from
>>> /usr/local/apache2/modules/mod_jk.so
>>> #5  0x00002b34408f40b6 in jk_handler () from
>>> /usr/local/apache2/modules/mod_jk.so
>>> #6  0x0000000000448eca in ap_run_handler ()
>>> #7  0x000000000044cc92 in ap_invoke_handler ()
>>> #8  0x000000000045e24f in ap_process_async_request ()
>>> #9  0x000000000045e38f in ap_process_request ()
>>> #10 0x000000000045ab65 in ap_process_http_connection ()
>>> #11 0x00000000004530ba in ap_run_process_connection ()
>>> #12 0x000000000046423a in child_main ()
>>> #13 0x0000000000464544 in make_child ()
>>> #14 0x00000000004649ae in prefork_run ()
>>> #15 0x0000000000430634 in ap_run_mpm ()
>>> #16 0x000000000042ad97 in main ()
>>
>> So mod_jk tries to get a lock on the shared memory before reading and
>> updating some shared memory data. That as one of many things mod_jk does
>> is normal. It would be not normal, if most processes seem to almost always
>> sit in this stack.
>>
>>> strace:
>>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) =
>>> 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) =
>> 0
>>> time(NULL)                              = 1424711498
>>> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) =
>>> 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1})
>>> = 0 fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0,
>>> len=1}) = 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET,
>>> start=0, len=1}) = 0 fcntl(19, F_SETLKW, {type=F_WRLCK,
>>> whence=SEEK_SET, start=0, len=1}) = 0 fcntl(19, F_SETLKW,
>>> {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
>>>
>>> Any help tracking this issue down would be appreciated.
>>
>> This doesn't look like mod_jk is hanging in the jk_shm_lock() but instead
>> it looks like normal processing, locking and then unlocking.
>> Locking and unlocking suceeds (return code 0).
>>
>> You didn't provide time stamps to get an idea, whether it is normal
>> behavior or not. What is your request throughput (requests per second
>> forwarded by mod_jk as long as it is running well)?
>
> Next time it happens I will capture timestamps, but it scrolls pretty
> quickly and doesn't seem to get hung up on any calls.  We have very seasonal
> traffic, but normally around 30/sec.  The high period peaks at around 200/sec.

OK, strace should be able to print time stamps plus the time that the 
call took.

Load around 30/s up to 200/s is not extreme.

>> I suspect something else is wrong. Have you checked, whether in fact
>> mod_jk waits for the response from requests it has send to the backend and
>> which do not return quickly, e.g. looking at a Java thread dump
>> (not: heap dump) of the backend during the times you have problems?
>
> I haven't, next time it happens I'll look.
>
>> What are the error or warn messages in your mod_jk log file?
>
> Here is one instance of each type of warn/error in the logs of an affected host.
> All of these are fairly common but we have tomcats going up and down all the
> time and I think most are related to that.  Log level is info.
>
> [Mon Feb 23 03:05:37 2015] [25833:47348544151936] [error] ajp_service::jk_ajp_common.c (2693): (my27) connecting to tomcat failed.
> [Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error] ajp_connect_to_endpoint::jk_ajp_common.c (1050): (ead21) cping/cpong after connecting to the backend server failed (errno=110)
> [Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error] ajp_send_request::jk_ajp_common.c (1663): (ead21) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)
> [Mon Feb 23 12:59:18 2015] [25927:47348544151936] [error] ajp_get_reply::jk_ajp_common.c (2176): (dir16) Tomcat is down or refused connection. No response has been sent to the client (yet)
> [Mon Feb 23 14:12:12 2015] [26170:47348544151936] [warn] ajp_handle_cping_cpong::jk_ajp_common.c (965): awaited reply cpong, received 3 instead. Closing connection

I don't expect those to be consequences of a mod_jk problem. More likely 
it is the other way round. How many threads do you have configured for 
your AJP connector? Which connector (bio, nio, apr) do you use for Tomcat?

How many Apache web servers sit in front of the Tomcats? Is each of the 
Apaches configured with 512 processes?

>> Did you use "apachectl graceful" shortly before the problems start, or
>> change configuration via the mod_jk status worker?
>
> No.
>
>> What is special here, is the use of many processes plus many workers. So
>> the lock is used quite a lot. Still the "global maintain" functionality
>> which uses the jk_shm_lock() in the stack above should be called by each
>> process only every worker.maintain seconds, by default every 60 seconds.
>> And each project should quickly detect whether another process already did
>> the global maintain. During the global maintain, for any ajp13 worker
>> there is really just a few simply local code statements. For any lb (load
>> balancer) worker, there is a little more to do, especialy checking whether
>> any members have failed long ago and should now be recovered. ut still
>> those operations are all local, not going on the network.
>>
>> What is your workr struture? Are the 150 workers part of few load
>> balancers or are they all in the worker.list? Can you show significant
>> parts of your workers.properties? Are you using the non-default
>> "pessimistic" locking?
>
> I misspoke, we have ~150 applications so that would be the load balancer
> count.  Total worker count between all the LBs would be ~700.  We are using
> the default locking.  Here is one LB:
>
> # 109t
> worker.109t1.port=62037
> worker.109t1.host=java1
> worker.109t1.type=ajp13
> worker.109t1.lbfactor=1
> worker.109t1.connection_pool_timeout=539
> worker.109t1.socket_keepalive=true
> worker.109t1.ping_mode=CP
> worker.109t1.ping_timeout=5000
> worker.109t1.redirect=109t2
>
> worker.109t2.reference=worker.109t1
> worker.109t2.host=java2
> worker.109t2.redirect=109t3
>
> worker.109t3.reference=worker.109t1
> worker.109t3.host=java3
> worker.109t3.redirect=109t4
>
> worker.109t4.reference=worker.109t1
> worker.109t4.host=java4
> worker.109t4.redirect=109t1
>
> worker.109tlb.type=lb
> worker.109tlb.balance_workers=109t1,109t2,109t3,109t4
> worker.109tlb.sticky_session=true

And only the lb is added to the worker.list? Or also the individual AJP 
workers?

>> If your Apache uses a threaded APR library, you could try using
>> JkWatchdogInterval. That moved the maintenance from the normal request
>> processing code to a separate thread per Apache process.

I suspect that once your app (one or some of your apps) gets slow, 
requests pile up in Apache and >Apache can no longer connect to Tomcat, 
because it doesn't have free threads. Some Apache children still can 
connect, because the connect is handled by the OS underneath Tomcat, but 
those connections do not get the awaited cping reply.

In addition to the Thread Dumps, you should also activate GC-Logs for 
your Tomcat JVMs.

>> - using the worker or prefork MPM in combination with JkWatchdogInterval would
>> reduce the amount of shm locking due to global maintenance calls, because it
>> is done per process, not per thread. You might have other reasons for using
>> prefork though. Have you?
>
> Use of prefork is historical.  We have been looking at going to event, but
> Haven't made the switch yet.

If you are using many Apaches each with 512 children, you will most 
likely need to use an NIO connector on the Tomcat processes and would 
also benefit from worker or event MPM for Apache. Each process e.g. with 
50 or 100 threads and a relatively short connection_pool_timeout like 
e.g. 60 seconds.

What was the reason, that you started suspecting the fcntl() underneath 
jk_shm_lock? Are we sure, that the stack you provided is really the 
typical one? And was the strace dumping all system calls or only the 
fcntl calls?

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk causing stuck Apache processes

Posted by Jesse Defer <Je...@asu.edu>.
> -----Original Message-----
> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
> > I have a farm of Apache httpd servers proxying to Tomcat with mod_jk and
> I am having issues with Apache processes getting stuck (as seen by the W
> state in server-status).  I am sending to this list because the stack
> traces show httpd gets stuck in mod_jk.
> >
> > httpd is configured for prefork and with 512 servers on start and
> maximum.  When the problem occurs we end up with nearly all 512 processes
> in the W state until we restart it.  The problem occurs more often when
> load is high but is not restricted to high load.  The problem started
> occuring more often when we increased the servers from 384 to 512.  The
> hosts have enough memory and do not swap.  The issue occurs intermitently
> and is not tied to a particular host or instance Tomcat (there are ~150
> Tomcat instances).  The JkShmFile is on tmpfs.
> 
> Why on tmpfs?

Not sure, might be because of poor local IO performance (hosts are VMs) or 
historic use of network filesystems.

> > Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1,
> > APR-UTIL 1.5.4
> >
> > The stuck httpd processes all show the same stack and strace:
> >
> > pstack:
> > #0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
> > #1  0x00002b3440911656 in jk_shm_lock () from
> > /usr/local/apache2/modules/mod_jk.so
> > #2  0x00002b3440917805 in ajp_maintain () from
> > /usr/local/apache2/modules/mod_jk.so
> > #3  0x00002b3440906cac in maintain_workers () from
> > /usr/local/apache2/modules/mod_jk.so
> > #4  0x00002b3440901140 in wc_maintain () from
> > /usr/local/apache2/modules/mod_jk.so
> > #5  0x00002b34408f40b6 in jk_handler () from
> > /usr/local/apache2/modules/mod_jk.so
> > #6  0x0000000000448eca in ap_run_handler ()
> > #7  0x000000000044cc92 in ap_invoke_handler ()
> > #8  0x000000000045e24f in ap_process_async_request ()
> > #9  0x000000000045e38f in ap_process_request ()
> > #10 0x000000000045ab65 in ap_process_http_connection ()
> > #11 0x00000000004530ba in ap_run_process_connection ()
> > #12 0x000000000046423a in child_main ()
> > #13 0x0000000000464544 in make_child ()
> > #14 0x00000000004649ae in prefork_run ()
> > #15 0x0000000000430634 in ap_run_mpm ()
> > #16 0x000000000042ad97 in main ()
> 
> So mod_jk tries to get a lock on the shared memory before reading and
> updating some shared memory data. That as one of many things mod_jk does
> is normal. It would be not normal, if most processes seem to almost always
> sit in this stack.
> 
> > strace:
> > fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) =
> > 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) =
> 0
> > time(NULL)                              = 1424711498
> > fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) =
> > 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1})
> > = 0 fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0,
> > len=1}) = 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET,
> > start=0, len=1}) = 0 fcntl(19, F_SETLKW, {type=F_WRLCK,
> > whence=SEEK_SET, start=0, len=1}) = 0 fcntl(19, F_SETLKW,
> > {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
> >
> > Any help tracking this issue down would be appreciated.
> 
> This doesn't look like mod_jk is hanging in the jk_shm_lock() but instead
> it looks like normal processing, locking and then unlocking.
> Locking and unlocking suceeds (return code 0).
> 
> You didn't provide time stamps to get an idea, whether it is normal
> behavior or not. What is your request throughput (requests per second
> forwarded by mod_jk as long as it is running well)?

Next time it happens I will capture timestamps, but it scrolls pretty 
quickly and doesn't seem to get hung up on any calls.  We have very seasonal
traffic, but normally around 30/sec.  The high period peaks at around 200/sec.

> I suspect something else is wrong. Have you checked, whether in fact
> mod_jk waits for the response from requests it has send to the backend and
> which do not return quickly, e.g. looking at a Java thread dump
> (not: heap dump) of the backend during the times you have problems?

I haven't, next time it happens I'll look.

> What are the error or warn messages in your mod_jk log file?

Here is one instance of each type of warn/error in the logs of an affected host.
All of these are fairly common but we have tomcats going up and down all the 
time and I think most are related to that.  Log level is info.

[Mon Feb 23 03:05:37 2015] [25833:47348544151936] [error] ajp_service::jk_ajp_common.c (2693): (my27) connecting to tomcat failed.
[Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error] ajp_connect_to_endpoint::jk_ajp_common.c (1050): (ead21) cping/cpong after connecting to the backend server failed (errno=110)
[Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error] ajp_send_request::jk_ajp_common.c (1663): (ead21) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)
[Mon Feb 23 12:59:18 2015] [25927:47348544151936] [error] ajp_get_reply::jk_ajp_common.c (2176): (dir16) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Mon Feb 23 14:12:12 2015] [26170:47348544151936] [warn] ajp_handle_cping_cpong::jk_ajp_common.c (965): awaited reply cpong, received 3 instead. Closing connection

> Did you use "apachectl graceful" shortly before the problems start, or
> change configuration via the mod_jk status worker?

No.
 
> What is special here, is the use of many processes plus many workers. So
> the lock is used quite a lot. Still the "global maintain" functionality
> which uses the jk_shm_lock() in the stack above should be called by each
> process only every worker.maintain seconds, by default every 60 seconds.
> And each project should quickly detect whether another process already did
> the global maintain. During the global maintain, for any ajp13 worker
> there is really just a few simply local code statements. For any lb (load
> balancer) worker, there is a little more to do, especialy checking whether
> any members have failed long ago and should now be recovered. ut still
> those operations are all local, not going on the network.
> 
> What is your workr struture? Are the 150 workers part of few load
> balancers or are they all in the worker.list? Can you show significant
> parts of your workers.properties? Are you using the non-default
> "pessimistic" locking?

I misspoke, we have ~150 applications so that would be the load balancer 
count.  Total worker count between all the LBs would be ~700.  We are using 
the default locking.  Here is one LB:

# 109t
worker.109t1.port=62037
worker.109t1.host=java1
worker.109t1.type=ajp13
worker.109t1.lbfactor=1
worker.109t1.connection_pool_timeout=539
worker.109t1.socket_keepalive=true
worker.109t1.ping_mode=CP
worker.109t1.ping_timeout=5000
worker.109t1.redirect=109t2

worker.109t2.reference=worker.109t1
worker.109t2.host=java2
worker.109t2.redirect=109t3

worker.109t3.reference=worker.109t1
worker.109t3.host=java3
worker.109t3.redirect=109t4

worker.109t4.reference=worker.109t1
worker.109t4.host=java4
worker.109t4.redirect=109t1

worker.109tlb.type=lb
worker.109tlb.balance_workers=109t1,109t2,109t3,109t4
worker.109tlb.sticky_session=true

> If your Apache uses a threaded APR library, you could try using
> JkWatchdogInterval. That moved the maintenance from the normal request
> processing code to a separate thread per Apache process.
> 
> Regards,
> 
> Rainer
> 
> - using the worker or prefork MPM in combination with JkWatchdogInterval would 
> reduce the amount of shm locking due to global maintenance calls, because it 
> is done per process, not per thread. You might have other reasons for using 
> prefork though. Have you?

Use of prefork is historical.  We have been looking at going to event, but 
Haven't made the switch yet.

Respectfully,

Jesse

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk causing stuck Apache processes

Posted by Rainer Jung <ra...@kippdata.de>.
Am 23.02.2015 um 19:03 schrieb Jesse Defer:
> I have a farm of Apache httpd servers proxying to Tomcat with mod_jk and I am having issues with Apache processes getting stuck (as seen by the W state in server-status).  I am sending to this list because the stack traces show httpd gets stuck in mod_jk.
>
> httpd is configured for prefork and with 512 servers on start and maximum.  When the problem occurs we end up with nearly all 512 processes in the W state until we restart it.  The problem occurs more often when load is high but is not restricted to high load.  The problem started occuring more often when we increased the servers from 384 to 512.  The hosts have enough memory and do not swap.  The issue occurs intermitently and is not tied to a particular host or instance Tomcat (there are ~150 Tomcat instances).  The JkShmFile is on tmpfs.

Why on tmpfs?

> Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1, APR-UTIL 1.5.4
>
> The stuck httpd processes all show the same stack and strace:
>
> pstack:
> #0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
> #1  0x00002b3440911656 in jk_shm_lock () from /usr/local/apache2/modules/mod_jk.so
> #2  0x00002b3440917805 in ajp_maintain () from /usr/local/apache2/modules/mod_jk.so
> #3  0x00002b3440906cac in maintain_workers () from /usr/local/apache2/modules/mod_jk.so
> #4  0x00002b3440901140 in wc_maintain () from /usr/local/apache2/modules/mod_jk.so
> #5  0x00002b34408f40b6 in jk_handler () from /usr/local/apache2/modules/mod_jk.so
> #6  0x0000000000448eca in ap_run_handler ()
> #7  0x000000000044cc92 in ap_invoke_handler ()
> #8  0x000000000045e24f in ap_process_async_request ()
> #9  0x000000000045e38f in ap_process_request ()
> #10 0x000000000045ab65 in ap_process_http_connection ()
> #11 0x00000000004530ba in ap_run_process_connection ()
> #12 0x000000000046423a in child_main ()
> #13 0x0000000000464544 in make_child ()
> #14 0x00000000004649ae in prefork_run ()
> #15 0x0000000000430634 in ap_run_mpm ()
> #16 0x000000000042ad97 in main ()

So mod_jk tries to get a lock on the shared memory before reading and 
updating some shared memory data. That as one of many things mod_jk does 
is normal. It would be not normal, if most processes seem to almost 
always sit in this stack.

> strace:
> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
> time(NULL)                              = 1424711498
> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
> fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
> fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
>
> Any help tracking this issue down would be appreciated.

This doesn't look like mod_jk is hanging in the jk_shm_lock() but 
instead it looks like normal processing, locking and then unlocking. 
Locking and unlocking suceeds (return code 0).

You didn't provide time stamps to get an idea, whether it is normal 
behavior or not. What is your request throughput (requests per second 
forwarded by mod_jk as long as it is running well)?

I suspect something else is wrong. Have you checked, whether in fact 
mod_jk waits for the response from requests it has send to the backend 
and which do not return quickly, e.g. looking at a Java thread dump 
(not: heap dump) of the backend during the times you have problems?

What are the error or warn messages in your mod_jk log file?

Did you use "apachectl graceful" shortly before the problems start, or 
change configuration via the mod_jk status worker?

What is special here, is the use of many processes plus many workers. So 
the lock is used quite a lot. Still the "global maintain" functionality 
which uses the jk_shm_lock() in the stack above should be called by each 
process only every worker.maintain seconds, by default every 60 seconds. 
And each project should quickly detect whether another process already 
did the global maintain. During the global maintain, for any ajp13 
worker there is really just a few simply local code statements. For any 
lb (load balancer) worker, there is a little more to do, especialy 
checking whether any members have failed long ago and should now be 
recovered. ut still those operations are all local, not going on the 
network.

What is your workr struture? Are the 150 workers part of few load 
balancers or are they all in the worker.list? Can you show significant 
parts of your workers.properties? Are you using the non-default 
"pessimistic" locking?

If your Apache uses a threaded APR library, you could try using 
JkWatchdogInterval. That moved the maintenance from the normal request 
processing code to a separate thread per Apache process.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org