You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Andrew Ho <an...@tellme.com> on 2001/02/14 22:24:00 UTC

Apache 1.3.x race condition causes gratuitous 3-second CGI delay

Hello,

We've discovered an apparent race condition in Apache 1.3.x CGI handling
which results in occasional unnecessary 3-second delays resulting from a
pause between the closing of the CGI child process' pipe and that process'
subsequent exit.

Specifically, the code in mod_cgi.c reads from its child process until the
child process breaks the pipe. The cleanup code in alloc.c then calls
waitpid with WNOHANG to check to see if its child process has died; if its
pid is not waiting, Apache assumes that the process has hung. It sends a
SIGTERM, waits 3 seconds, then sends a SIGKILL. The relevant code is in
free_proc_chain() in alloc.c.

The former assumption (if the child pid is not waiting to be reaped, the
child process must have hung and should be killed) appears to be erroneous
on at least some configurations. Specifically, imagine that the CGI child
process exits 10ms after the cleanup code in alloc.c is run. In this case,
the Apache process sleeps 3 seconds, when really it didn't need to.

This problem is only client-visible with HTTP/1.1 keep-alive, Apache
running as a single process, or bad luck where the client talks to the
same child more than once. The user-visible symptom is then a 3-second
delay following a CGI request.

On Solaris x86 on a dual-processor box, we see this behavior perhaps 1/5
of the time for any particular child. On most other systems we tested,
you have to explicitly try to trigger the bug. This CGI will do it:

    #!/usr/local/bin/perl
    print "Content-type: text/plain\n\n";
    print "Hello, world.\n";
    close STDOUT;
    sleep 1;  # or anything else that does a system call, really
    exit 0;

Instead of delaying 1 second, the child displays a 3 second delay after
running this CGI. Note that on our dual-processor Solaris x86 boxes, even
just the simple Hello, world (no premature closing of STDOUT or sleeping)
exhibits the 3 second delay regularly. I'm not sure why this is; perhaps
Solaris is doing some cleanup that Linux is not, or there is some child
reaping issue with the multiple processors.

Here are the configurations we tested. Patched Apaches (with mod_perl
or mod_ssl capabilities) had the same behaviors as straight out-of-the-box
configurations; having DSOs enabled was likewise irrelevant.

    * Solaris x86, dual processor Intel boxes, Apache 1.3.9, 1.3.1[247]
        * On Apache 1.3.14, mod_perl and mod_ssl and non-DSO variants
        * All configurations display sporadic 3-second CGI delays
          even in a simple Hello, world CGI.
    * Linux, single processor Intel boxes, Apache 1.3.12, 1.3.14
        * Without explicitly closing STDOUT, the bug doesn't appear,
          but if you close STDOUT and do really anything at all
          (including just a timing loop), the bug appears
    * Solaris on a single processor Sparc box, Apache 1.3.12
    * FreeBSD, dual processor Intel box, Apache 1.3.12
    * OpenBSD, single process Intel box, Apache 1.3.12

I've appended my test script below, it is a simple Perl script that opens
a socket connection to a webserver and does repeated HTTP/1.1 Keep-Alive
requests, timing each trial.

This bug may be the same as PR 6961 (repeated requests for a simple cgi
invoke delay of Apache) and is related loosely to PR 6226 (closing STDOUT
doesn't end session to allow background processing of code). What's the
best thing for me to do--file a new PR with this information?

Thanks in advance--

Humbly,

Andrew

----------------------------------------------------------------------
Andrew Ho               http://www.tellme.com/       andrew@tellme.com
Engineer                   info@tellme.com          Voice 650-930-9062
Tellme Networks, Inc.       1-800-555-TELL            Fax 650-930-9101
----------------------------------------------------------------------




#!/usr/local/bin/perl -w
# ========================================================================
# kahit - hit an HTTP/1.1 webserver over and over again in Keep-Alive mode
# Andrew Ho (andrew@tellme.com)
#
# This script opens a network connection to a webserver, and issues an
# HTTP request that specifies Keep-Alive. It issues that request over and
# over again on the same connection, exiting when the server side closes
# the connection explicitly. It understands both Content-Length and
# chunked Transfer-Encoding, and respects the Connection header.
#
# Last modified February 14, 2001
# ========================================================================
use strict;

use Socket;
use Time::HiRes qw(gettimeofday tv_interval);

use vars qw($HOST $PORT $PATH);
$HOST = 'sparky.tellme.com';
$PORT = 80;
$PATH = '/~andrew/break.cgi';

use vars qw($CRLF);
$CRLF = "\015\012";

use vars qw($VERBOSE);
$VERBOSE = 0;


# ------------------------------------------------------------------------
# Set up network connection

my $proto = getprotobyname('tcp');

socket(CLIENT, PF_INET, SOCK_STREAM, $proto) || die "socket error: $!";
my $addr = inet_aton($HOST)                  || die "could not resolve $HOST";
my $paddr = sockaddr_in($PORT, $addr);

connect(CLIENT, $paddr)                      || die "connect error: $!";


# ------------------------------------------------------------------------
# Main loop

$SIG{INT} = $SIG{TERM} = sub { exit };    # trigger END block on TERM
printf "----started, %s----\n", scalar localtime;

my $eof;
while (!$eof) {
    printf "----begin, %s----\n", scalar localtime;
    my $begin = [ gettimeofday ];

    # Output HTTP request line and Connection header
    print CLIENT 'GET ', $PATH, ' HTTP/1.1', $CRLF;
    print        'GET ', $PATH, ' HTTP/1.1', "\n" if $VERBOSE;
    print CLIENT 'Host: ', $HOST, $CRLF;
    print        'Host: ', $HOST, "\n" if $VERBOSE;
    print CLIENT 'Connection: keep-alive', $CRLF;
    print        'Connection: keep-alive', "\n" if $VERBOSE;
    print CLIENT $CRLF;
    print "\n" if $VERBOSE;

    # Flush output to both server and terminal
    my $fh = select CLIENT;
    local $| = 1;
    select $fh;
    local $| = 1 if $VERBOSE;

    # Read response headers
    my($content_length, $chunked);
    while(<CLIENT>) {
        # Respect EOF and Connection: close header
        do { $eof = 1; last } if !defined $_ || /^connection\:.*close/i;
        $content_length = $1 if /^content\-length\:\s*(\d+)\s*$/i;
        $chunked = 1 if /^transfer\-encoding\:.*chunked/i;
        print if $VERBOSE;
        last if /^\s*$/;
    }

    # Read response entity body
    if(defined $content_length) {
        # Got a Content-Length, so we can just read once

        # Content-length can be 0, in which case don't read
        if($content_length) {
            my $buffer;
            if($content_length != read CLIENT, $buffer, $content_length) {
                warn "did not read $content_length bytes";
            }
            print $buffer if $VERBOSE;
        }

    } elsif($chunked) {
        # Use chunked Transfer-Encoding (see RFC 2616, 19.4.6)

        # Read first chunk size
        my $chunk_size = <CLIENT>;
        do { $eof = 1; last } unless defined $chunk_size;
        print $chunk_size if $VERBOSE;
              $chunk_size =~ s/\s+$//;
              $chunk_size = hex $chunk_size;

        while($chunk_size > 0) {
            my $buffer;
            if($chunk_size != read CLIENT, $buffer, $chunk_size) {
                warn "did not read $chunk_size bytes";
            }
            print $buffer if $VERBOSE;

            # Read next chunk size (identical to code above)
            $chunk_size = <CLIENT>;
            do { $eof = 1; last } unless defined $chunk_size;
            print $chunk_size if $VERBOSE;
                  $chunk_size =~ s/\s+$//;
                  $chunk_size = hex $chunk_size;
        }

        # There can be more headers, or a CRLF here
        while(<CLIENT>) {
            do { $eof = 1; last } unless defined $_;
            print if $VERBOSE;
            last if /^\s*$/;
        }

    } else {
        # Hrm. Don't know how to handle this case

        warn "no Content-Length, and not chunked Transfer-Encoding\n";
        $eof = 1;

    }

    my $elapsed = tv_interval($begin, [ gettimeofday ]);
    printf "----end, %0.3f ms elapsed----\n", $elapsed * 1000;
}


# ------------------------------------------------------------------------
# Clean up

END {
    $SIG{INT} = $SIG{TERM} = 'IGNORE';
    close CLIENT;
    printf "----terminated, %s----\n", scalar localtime;
    exit 0;
}


# ========================================================================
__END__

Re: Apache 1.3.x race condition causes gratuitous 3-second CGI delay

Posted by Andrew Ho <an...@tellme.com>.

Hello,

AH>We've discovered an apparent race condition in Apache 1.3.x CGI handling
AH>which results in occasional unnecessary 3-second delays resulting from a
AH>pause between the closing of the CGI child process' pipe and that process'
AH>subsequent exit.

JJ>Thanks for the very detailed write-up. We'll look deeper and see if
JJ>we can clean this up, and use existing knowledge to TERM-wait-KILL
JJ>only when needed.

Thanks for the response! FWIW, here is a patch to alloc.c which fixed the
problem for our installation; all it does is install a SIGCHLD handler
right before the sleep(3), and now that the signal is being paid attention
to, the signal pops us out of the sleep(). We then re-install any previous
SIGCHLD handler.

There's still a race between setting the handler and sleep()ing, but
that's a pretty tight race and the cost of it is a relatively non-tragic
3-second delay in subsequent Keep-Alive requests; not too bad.

What I'm afraid of is that our "fix" is entirely too naive. There's plenty
of interactions I'm not familiar with; for example, there may be multiple
SIGCHLDs flying around, and we don't check to see that the one that just
came is the right one.

Humbly,

Andrew

----------------------------------------------------------------------
Andrew Ho               http://www.tellme.com/       andrew@tellme.com
Engineer                   info@tellme.com          Voice 650-930-9062
Tellme Networks, Inc.       1-800-555-TELL            Fax 650-930-9101
----------------------------------------------------------------------

Re: Apache 1.3.x race condition causes gratuitous 3-second CGI delay

Posted by Jim Jagielski <ji...@jaguNET.com>.

>Hello,
>
>We've discovered an apparent race condition in Apache 1.3.x CGI handling
>which results in occasional unnecessary 3-second delays resulting from a
>pause between the closing of the CGI child process' pipe and that process'
>subsequent exit.
>
>Specifically, the code in mod_cgi.c reads from its child process until the
>child process breaks the pipe. The cleanup code in alloc.c then calls
>waitpid with WNOHANG to check to see if its child process has died; if its
>pid is not waiting, Apache assumes that the process has hung. It sends a
>SIGTERM, waits 3 seconds, then sends a SIGKILL. The relevant code is in
>free_proc_chain() in alloc.c.
>

Thanks for the very detailed write-up. We'll look deeper and see if
we can clean this up, and use existing knowledge to TERM-wait-KILL
only when needed.
-- 
===========================================================================
   Jim Jagielski   [|]   jim@jaguNET.com   [|]   http://www.jaguNET.com/
          "Casanova will have many weapons; To beat him you will
              have to have more than forks and flatulence."