You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Brian Candler <B....@pobox.com> on 2003/03/18 18:24:23 UTC

[PATCH] Race condition with CGI reaping under Solaris

Hello,

For a while I have been noticing performance problems with CGI scripts
running under Apache (1.3.27) and Solaris (2.8). These are "normal" forked
CGI scripts: not FastCGI, mod_perl etc.

The symptom is that very often there is a three-second delay between
requesting the page and getting the completed response. It's actually
reasonably easy to replicate. Take a simple CGI script like this:

#!/bin/sh
echo "Content-Type: text/html"
echo ""
echo "Hello from shell"

Hit the webserver with 50 requests straight after each other for this page,
and I find that almost every alternate request takes 3 seconds. However,
with this example, trussing the process which answers the requests is enough
to make the problem go away (hinting at a race condition).

Running the equivalent script in Ruby did allow me to get a truss on it and
still fail occasionally, which showed me:

12458:  24.7093 waitid(P_PID, 12472, 0xFFBEF7E8, WEXITED|WTRAPPED|WNOHANG) = 0
12458:  24.7096 kill(12472, SIGTERM)                            = 0
...
12458:  24.7104 alarm(3)                                        = 0
...
12458:  sigsuspend(0xFFBEF7F8)          (sleeping...)
...
12458:  27.7048     Received signal #14, SIGALRM, in sigsuspend() [caught]
12458:  27.7051 sigsuspend(0xFFBEF7F8)                          Err#4 EINTR

It then turns out that you can replicate the program 100% reliably by
introducing a small delay after closing stdin/stdout but before exiting:

#!/usr/local/bin/ruby
puts "Content-Type: text/html"
puts ""
puts "Hello from Ruby"
$stdin.close
$stdout.close
sleep 0.2

>From looking through the code, I believe the problem lies in free_proc_chain
in src/main/alloc.c. The logic is:

- call waitpid() for each process to be cleaned up, to see if it has
  already died
- if it hasn't, then send it a kill SIGTERM
- if kill returns 0 (meaning the process is still alive) then sleep 3 seconds
  then send it a SIGKILL

The race condition seems to as follows: when the CGI process has finished
its work, it closes stdin/stdout which completes the request. However it
takes a short period of time to tidy up after itself before it finally dies,
by which time Apache has already decided to send it a SIGTERM and is
committed to a 3 second wait.

This is the first time I have dug around the internals of Apache and I am
not certain as to what is _supposed_ to happen in this case; in particular I
don't know if the sleep(3) is meant to be interruptible by a SIGCHLD.
However if it is, it certainly isn't happening on my platform.

My proposed solution is to poll waitpid() a few times before deciding that
the process needs to be sent a TERM. I have put together a rough patch as
proof-of-concept (attached) which does this at intervals of 10ms, 20ms,
40ms, 80ms, giving a total potential wait of 150ms. Maybe it should have
another iteration to make 310ms. It's not particularly pretty but I wanted
to localise my changes as far as possible.

Given my lack of knowledge of the Apache codebase I can't say whether this
is The Right Way[TM] to deal with this problem, but it's certainly made a
huge improvement here. Comments gratefully received.

If this is the right approach, perhaps one of the main Apache developers
would like to produce a tidier patch?

Regards,

Brian Candler.

Re: [PATCH] Race condition with CGI reaping under Solaris

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
At 11:55 AM 3/18/2003, Bill Stoddard wrote:
>This patch was committed to 1.3.28-dev about 4 months ago (gee, has it really been 4+ months since a 1.3. release?)

I think Jim pointed out that it would be good to get a 1.3.28 release
done, but I don't know if anyone has stepped up to RM (no, I'm not
stepping up ... I have a very full plate already ;-)

Considering alot of folks will be 'freshening' their openssl and mod_ssl
builds, this would be an ideal time to provide 1.3.28... yes - we probably
have nothing to do on the Apache 1.3 side of the coin for the recent SSL 
timing issues, but the best code we've got would be nice to share :-)

So just consider this a preemptive +1, and I'd be happy to get the
Win32 binaries together.

Bill



Re: [PATCH] Race condition with CGI reaping under Solaris

Posted by Brian Candler <B....@pobox.com>.
On Tue, Mar 18, 2003 at 12:55:52PM -0500, Bill Stoddard wrote:
> This patch was committed to 1.3.28-dev about 4 months ago (gee, has it 
> really been 4+ months since a 1.3. release?)
> 
> http://cvs.apache.org/viewcvs.cgi/apache-1.3/src/main/alloc.c.diff?r1=1.128&r2=1.129

Thank you. I never thought to check CVS - it seems to me too like 1.3.27 was
a recent release :-)

That patch is not quite the same as mine, as it still sends a SIGTERM
immediately. I guess any application with a clean-up phase will have to
catch SIGTERM if it wants to be sure of having a chance to finish its job;
although if it's just the tail end of a C runtime environment finishing off
then it probably doesn't matter.

One other comment: I note that the loop which monitors for process status
uses waitpid(...WNOHANG). But the first bit of code which checks the process
status is wrapped inside

#ifndef NEED_WAITPID
...
#endif

So I just wonder, if waitpid() is safe to use in the second part, why the
first part needs to be #ifndef'd out in this way?

Regards,

Brian.

Re: [PATCH] Race condition with CGI reaping under Solaris

Posted by Bill Stoddard <bi...@wstoddard.com>.
This patch was committed to 1.3.28-dev about 4 months ago (gee, has it 
really been 4+ months since a 1.3. release?)

http://cvs.apache.org/viewcvs.cgi/apache-1.3/src/main/alloc.c.diff?r1=1.128&r2=1.129

Bill