You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Joe Orton <jo...@redhat.com> on 2004/08/13 15:51:23 UTC

[PATCH] fix child reclaim timing

The 2.0 ap_reclaim_child_processes logic seems to be broken - it never
resets the waittime variable as it did in 1.3; so the parent will wait
for up to 23 minutes (sic) in total for a stuck child process.  (SIGSTOP
a child and strace the parent to see for yourself)

This updates the logic to be a little more sane:

- at t + 16, 82, 344 ms, just waitpid()
- at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child
- at t + 1.74 secs, waitpid() else SIGKILL the child
- at t + 1.75, 1.82 secs, just waitpid()
- at t + 2.08 secs, waitpid() else log "this child won't die" 

Any comments?

Index: mpm_common.c
===================================================================
RCS file: /home/cvs/httpd-2.0/server/mpm_common.c,v
retrieving revision 1.120
diff -u -r1.120 mpm_common.c
--- mpm_common.c	15 Mar 2004 23:08:41 -0000	1.120
+++ mpm_common.c	13 Aug 2004 13:42:47 -0000
@@ -70,7 +70,7 @@
 
     ap_mpm_query(AP_MPMQ_MAX_DAEMON_USED, &max_daemons);
 
-    for (tries = terminate ? 4 : 1; tries <= 9; ++tries) {
+    for (tries = terminate ? 4 : 1; tries <= 10; ++tries) {
         /* don't want to hold up progress any more than
          * necessary, but we need to allow children a few moments to exit.
          * Set delay with an exponential backoff.
@@ -98,13 +98,15 @@
             switch (tries) {
             case 1:     /*  16ms */
             case 2:     /*  82ms */
+                break;
+                
             case 3:     /* 344ms */
-            case 4:     /*  16ms */
+                waittime = 16 * 1024;
                 break;
-
-            case 5:     /*  82ms */
-            case 6:     /* 344ms */
-            case 7:     /* 1.4sec */
+                
+            case 4:     /* 360ms */
+            case 5:     /* 425ms */
+            case 6:     /* 688ms */
                 /* ok, now it's being annoying */
                 ap_log_error(APLOG_MARK, APLOG_WARNING,
                              0, ap_server_conf,
@@ -114,7 +116,7 @@
                 kill(pid, SIGTERM);
                 break;
 
-            case 8:     /*  6 sec */
+            case 7:     /*  1.74 sec */
                 /* die child scum */
                 ap_log_error(APLOG_MARK, APLOG_ERR,
                              0, ap_server_conf,
@@ -132,9 +134,14 @@
                  */
                 kill_thread(pid);
 #endif
+                waittime = 16 * 1024;
+                break;
+
+            case 8:     /* 1.75 sec */
+            case 9:     /* 1.82 sec */
                 break;
 
-            case 9:     /* 14 sec */
+            case 10:    /* 2.08 secs */
                 /* gave it our best shot, but alas...  If this really
                  * is a child we are trying to kill and it really hasn't
                  * exited, we will likely fail to bind to the port

Re: [PATCH] fix child reclaim timing

Posted by Jeff Trawick <tr...@gmail.com>.
On Wed, 15 Sep 2004 19:49:22 +0100, Joe Orton <jo...@redhat.com> wrote:
> On Wed, Sep 15, 2004 at 01:48:11PM -0400, Jeff Trawick wrote:
> > Here's a patch that does something like I mentioned above, though it
> > bails out a bit sooner (9 or so seconds).  The timing of the
> > interesting actions in this patch can be tweaked in a much simpler
> > manner than the old 1.3 code allows.
> 
> Looks just as good to me, +1

re "just as good": it helps that stuff from jorton, jerenkrantz, and
nd don't go to my spam folder, then I don't go blindly patching... but
I committed mine anyway, with timings tweaked to be a little closer to
yours (a bit more patient in the intervals leading to SIGTERM) and
with the change below

> 
> > -         */
> > +    while (1) {
> >          apr_sleep(waittime);
> ...
> > -        if (!not_dead_yet) {
> > -            /* nothing left to wait for */
> > +        if (!not_dead_yet ||
> > +            action_table[cur_action].action == GIVEUP) {
> > +            /* nothing left to wait for, or we gave up */
> >              break;
> >          }
> >      }
> 
> ...is crying out to be a do/while :)

of course (thanks!)

Re: [PATCH] fix child reclaim timing

Posted by Joe Orton <jo...@redhat.com>.
On Wed, Sep 15, 2004 at 01:48:11PM -0400, Jeff Trawick wrote:
> Here's a patch that does something like I mentioned above, though it
> bails out a bit sooner (9 or so seconds).  The timing of the
> interesting actions in this patch can be tweaked in a much simpler
> manner than the old 1.3 code allows.

Looks just as good to me, +1

> -         */
> +    while (1) {
>          apr_sleep(waittime);
...
> -        if (!not_dead_yet) {
> -            /* nothing left to wait for */
> +        if (!not_dead_yet ||
> +            action_table[cur_action].action == GIVEUP) {
> +            /* nothing left to wait for, or we gave up */
>              break;
>          }
>      }

...is crying out to be a do/while :)


Re: [PATCH] fix child reclaim timing

Posted by Jeff Trawick <tr...@gmail.com>.
On Fri, 13 Aug 2004 10:27:11 -0400, Jeff Trawick <tr...@gmail.com> wrote:
> Here is my take on what is wrong with current code:
> 
> 1) It starts complaining a bit too soon.  Some third-party modules
> have rather complicated child exit strategies.  Whether or not that is
> good or bad (bad ;) ), it results in disturbing messages that wouldn't
> have appeared if we were a little more patient (2-3 seconds).  Also, I
> suspect that the use of threaded MPM affects how quickly the children
> are exiting now on Unix.
> 
> 2) It should never stop checking for exited processes less often than
> 1-2 seconds, even if it doesn't complain to error log that often.
> Like you say, current code can wait a VERY long time for child
> processes to exit.  In practice, I see that it can wait a VERY long
> time even after the last child has exited.
> 
> I'll agree that it should never wait so long, though I think around 15
> or so seconds total is reasonable.  Exiting before children are gone
> doesn't let Apache start up any more quickly; it just prevents
> potentially-useful information about timing from getting logged to the
> error log.
> 
> --/--
> 
> I wouldn't complain to error log at all until it has been 2 seconds,
> and then I'd still wait around for 10-15 more.  But it has to check
> every second so it finds out soon after all children have exited and
> doesn't sleep needlessly.
> 

Here's a patch that does something like I mentioned above, though it
bails out a bit sooner (9 or so seconds).  The timing of the
interesting actions in this patch can be tweaked in a much simpler
manner than the old 1.3 code allows.

Re: [PATCH] fix child reclaim timing

Posted by Jeff Trawick <tr...@gmail.com>.
On Fri, 13 Aug 2004 14:51:23 +0100, Joe Orton <jo...@redhat.com> wrote:
> The 2.0 ap_reclaim_child_processes logic seems to be broken - it never
> resets the waittime variable as it did in 1.3; so the parent will wait
> for up to 23 minutes (sic) in total for a stuck child process.  (SIGSTOP
> a child and strace the parent to see for yourself)
> 
> This updates the logic to be a little more sane:
> 
> - at t + 16, 82, 344 ms, just waitpid()
> - at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child
> - at t + 1.74 secs, waitpid() else SIGKILL the child
> - at t + 1.75, 1.82 secs, just waitpid()
> - at t + 2.08 secs, waitpid() else log "this child won't die"
> 
> Any comments?

Here is my take on what is wrong with current code:

1) It starts complaining a bit too soon.  Some third-party modules
have rather complicated child exit strategies.  Whether or not that is
good or bad (bad ;) ), it results in disturbing messages that wouldn't
have appeared if we were a little more patient (2-3 seconds).  Also, I
suspect that the use of threaded MPM affects how quickly the children
are exiting now on Unix.

2) It should never stop checking for exited processes less often than
1-2 seconds, even if it doesn't complain to error log that often. 
Like you say, current code can wait a VERY long time for child
processes to exit.  In practice, I see that it can wait a VERY long
time even after the last child has exited.

I'll agree that it should never wait so long, though I think around 15
or so seconds total is reasonable.  Exiting before children are gone
doesn't let Apache start up any more quickly; it just prevents
potentially-useful information about timing from getting logged to the
error log.

--/--

I wouldn't complain to error log at all until it has been 2 seconds,
and then I'd still wait around for 10-15 more.  But it has to check
every second so it finds out soon after all children have exited and
doesn't sleep needlessly.