You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Tom Gilbert <to...@linuxbrit.co.uk> on 2006/06/13 14:44:58 UTC
[PATCH] apparent bug in 1.3's free_proc_chain

We encountered an issue where delays were incurred when CGIs were being
used in conjunction with keepalives for certain site configurations. We
eventually tracked the issue down to the strace snippet below:

(This is after the cgi (process 23062) exit()ed and apache has logged).

12:34:37.004384 times({tms_utime=370, tms_stime=157, tms_cutime=287,
tms_cstime=83}) = 140165142
12:34:37.004433 wait4(23062, NULL, WNOHANG, NULL) = -1 ECHILD (No child
processes)
12:34:37.636010 kill(23062, SIGTERM) = -1 ESRCH (No such process)
12:34:37.636070 select(0, NULL, NULL, NULL, {0, 46875}) = 0 (Timeout)
12:34:37.684762 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:37.684819 select(0, NULL, NULL, NULL, {0, 46875}) = 0 (Timeout)
12:34:37.734850 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:37.734916 select(0, NULL, NULL, NULL, {0, 93750}) = 0 (Timeout)
12:34:37.834993 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:37.835056 select(0, NULL, NULL, NULL, {0, 187500}) = 0 (Timeout)
12:34:38.025223 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:38.025291 select(0, NULL, NULL, NULL, {0, 375000}) = 0 (Timeout)
12:34:38.405707 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:38.405777 select(0, NULL, NULL, NULL, {0, 750000}) = 0 (Timeout)
12:34:39.156833 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:39.157075 select(0, NULL, NULL, NULL, {1, 500000}) = 0 (Timeout)
12:34:40.654556 wait4(23062, NULL, WNOHANG|WUNTRACED, NULL) = -1 ECHILD
(No child processes)
12:34:40.654641 kill(23062, SIGKILL) = -1 ESRCH (No such process)

At the point the next request is processed.

With keepalives this holds up the next request by 2-3 secs while all the
timeouts happen, and as you can see it's pointless because the child is
gone (ESRCH, ECHILD are the giveaways here).

I'm not sure this is the ideal patch for this, but I thought I'd have a
shot:

--- /tmp/apache_1.3.36.orig/src/main/alloc.c    2006-04-21 19:40:11.000000000 +0100
+++ /tmp/apache_1.3.36/src/main/alloc.c 2006-06-13 13:32:25.000000000 +0100
@@ -2806,9 +2806,9 @@
 #elif defined(NETWARE)
 #else
 #ifndef NEED_WAITPID
-    /* Pick up all defunct processes */
+    /* Pick up all defunct processes and avoid killing dead processes */
     for (p = procs; p; p = p->next) {
-       if (waitpid(p->pid, (int *) 0, WNOHANG) > 0) {
+       if (waitpid(p->pid, (int *) 0, WNOHANG) != 0) {
            p->kill_how = kill_never;
        }
     }
@@ -2845,7 +2845,7 @@
             need_timeout = 0;
             for (p = procs; p; p = p->next) {
                 if (p->kill_how == kill_after_timeout) {
-                    if (waitpid(p->pid, (int *) 0, WNOHANG | WUNTRACED) > 0)
+                    if (waitpid(p->pid, (int *) 0, WNOHANG | WUNTRACED) != 0)
                         p->kill_how = kill_never;
                     else
                         need_timeout = 1;


Perhaps this should be more sophisticated but it seems to work here.

Tom.
-- 
   .^.    Tom Gilbert, London, England
   /V\    http://linuxbrit.co.uk
 /(   )\  google talk: tom.gilbert
  ^^-^^