You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Ed Korthof <ed...@apache.org> on 2000/01/04 22:45:41 UTC

[PATCH] reclaim_child_processes

hi --

this patch is intended to avoid a problem which i've witnessed in apache
installations with certain third party libraries: if there are many
children, and then take sufficiently long to shut down, then apache's
reclaim child processes sends the remaining children SIGKILL.  that's
alright with me -- shutdown shouldn't take so long -- but the current code
doesn't wait around for them to die; if they haven't all finished
terminating right away, it sleeps for approximately 16 seconds before
noticing that they're dead. thus, the total time required to shut down is
generally a bit more than twenty seconds ... the last 16 seconds of which
is quite unnecessary.

anyway, what this patch does is change things so that after SIGKILL has
been sent, the parent checks a couple of times (approx: 16ms, 84ms, 350ms,
1.4sec) and then decides that the SIGKILL failed.  i could easily adjust
this so that the total time spent waiting for the SIGKILL to fail is what
it was -- 16 seconds, give or take -- but in writing this, i figured that
if they haven't died after 1.4 seconds, then the SIGKILL wasn't
sufficient; it's not instant, but it should never take that long.

i sent out a couple of patchs about this earlier, but i've laid down the
crack pipe i was smoking when i sent the first one; and this makes no
changes to the comments as per Roy's comments.

i realize this doesn't happen except in a fairly pathological situation,
but we really should change the code in reclaim_child_processes to give
the children at least a little bit of time to die from the SIGKILL.  in
the pathological situation in which it occurs, it's quite annoying.

thanks --

ed

*****
Index: http_main.c
===================================================================
RCS file: /home/cvs/apache-1.3/src/main/http_main.c,v
retrieving revision 1.486
diff -b -c -u -r1.486 http_main.c
--- http_main.c	2000/01/01 17:07:34	1.486
+++ http_main.c	2000/01/04 21:34:27
@@ -2378,7 +2378,7 @@
 
     ap_sync_scoreboard_image();
 
-    for (tries = terminate ? 4 : 1; tries <= 9; ++tries) {
+    for (tries = terminate ? 4 : 1; tries <= 12; ++tries) {
 	/* don't want to hold up progress any more than 
 	 * necessary, but we need to allow children a few moments to exit.
 	 * Set delay with an exponential backoff.
@@ -2433,8 +2433,13 @@
 		   "child process %d still did not exit, sending a SIGKILL",
 			    pid);
 		kill(pid, SIGKILL);
+		waittime = 1024 * 16; /* give them some time to die */
 		break;
-	    case 9:     /* 14 sec */
+	    case 9:     /*   6 sec */
+	    case 10:    /* 6.1 sec */
+	    case 11:    /* 6.4 sec */
+		break;
+	    case 12:    /* 7.4 sec */
 		/* gave it our best shot, but alas...  If this really 
 		 * is a child we are trying to kill and it really hasn't
 		 * exited, we will likely fail to bind to the port