You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Joe Orton <jo...@redhat.com> on 2004/08/13 15:51:23 UTC
[PATCH] fix child reclaim timing
The 2.0 ap_reclaim_child_processes logic seems to be broken - it never
resets the waittime variable as it did in 1.3; so the parent will wait
for up to 23 minutes (sic) in total for a stuck child process. (SIGSTOP
a child and strace the parent to see for yourself)
This updates the logic to be a little more sane:
- at t + 16, 82, 344 ms, just waitpid()
- at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child
- at t + 1.74 secs, waitpid() else SIGKILL the child
- at t + 1.75, 1.82 secs, just waitpid()
- at t + 2.08 secs, waitpid() else log "this child won't die"
Any comments?
Index: mpm_common.c
===================================================================
RCS file: /home/cvs/httpd-2.0/server/mpm_common.c,v
retrieving revision 1.120
diff -u -r1.120 mpm_common.c
--- mpm_common.c 15 Mar 2004 23:08:41 -0000 1.120
+++ mpm_common.c 13 Aug 2004 13:42:47 -0000
@@ -70,7 +70,7 @@
ap_mpm_query(AP_MPMQ_MAX_DAEMON_USED, &max_daemons);
- for (tries = terminate ? 4 : 1; tries <= 9; ++tries) {
+ for (tries = terminate ? 4 : 1; tries <= 10; ++tries) {
/* don't want to hold up progress any more than
* necessary, but we need to allow children a few moments to exit.
* Set delay with an exponential backoff.
@@ -98,13 +98,15 @@
switch (tries) {
case 1: /* 16ms */
case 2: /* 82ms */
+ break;
+
case 3: /* 344ms */
- case 4: /* 16ms */
+ waittime = 16 * 1024;
break;
-
- case 5: /* 82ms */
- case 6: /* 344ms */
- case 7: /* 1.4sec */
+
+ case 4: /* 360ms */
+ case 5: /* 425ms */
+ case 6: /* 688ms */
/* ok, now it's being annoying */
ap_log_error(APLOG_MARK, APLOG_WARNING,
0, ap_server_conf,
@@ -114,7 +116,7 @@
kill(pid, SIGTERM);
break;
- case 8: /* 6 sec */
+ case 7: /* 1.74 sec */
/* die child scum */
ap_log_error(APLOG_MARK, APLOG_ERR,
0, ap_server_conf,
@@ -132,9 +134,14 @@
*/
kill_thread(pid);
#endif
+ waittime = 16 * 1024;
+ break;
+
+ case 8: /* 1.75 sec */
+ case 9: /* 1.82 sec */
break;
- case 9: /* 14 sec */
+ case 10: /* 2.08 secs */
/* gave it our best shot, but alas... If this really
* is a child we are trying to kill and it really hasn't
* exited, we will likely fail to bind to the port
Re: [PATCH] fix child reclaim timing
Posted by Jeff Trawick <tr...@gmail.com>.
On Wed, 15 Sep 2004 19:49:22 +0100, Joe Orton <jo...@redhat.com> wrote:
> On Wed, Sep 15, 2004 at 01:48:11PM -0400, Jeff Trawick wrote:
> > Here's a patch that does something like I mentioned above, though it
> > bails out a bit sooner (9 or so seconds). The timing of the
> > interesting actions in this patch can be tweaked in a much simpler
> > manner than the old 1.3 code allows.
>
> Looks just as good to me, +1
re "just as good": it helps that stuff from jorton, jerenkrantz, and
nd don't go to my spam folder, then I don't go blindly patching... but
I committed mine anyway, with timings tweaked to be a little closer to
yours (a bit more patient in the intervals leading to SIGTERM) and
with the change below
>
> > - */
> > + while (1) {
> > apr_sleep(waittime);
> ...
> > - if (!not_dead_yet) {
> > - /* nothing left to wait for */
> > + if (!not_dead_yet ||
> > + action_table[cur_action].action == GIVEUP) {
> > + /* nothing left to wait for, or we gave up */
> > break;
> > }
> > }
>
> ...is crying out to be a do/while :)
of course (thanks!)
Re: [PATCH] fix child reclaim timing
Posted by Joe Orton <jo...@redhat.com>.
On Wed, Sep 15, 2004 at 01:48:11PM -0400, Jeff Trawick wrote:
> Here's a patch that does something like I mentioned above, though it
> bails out a bit sooner (9 or so seconds). The timing of the
> interesting actions in this patch can be tweaked in a much simpler
> manner than the old 1.3 code allows.
Looks just as good to me, +1
> - */
> + while (1) {
> apr_sleep(waittime);
...
> - if (!not_dead_yet) {
> - /* nothing left to wait for */
> + if (!not_dead_yet ||
> + action_table[cur_action].action == GIVEUP) {
> + /* nothing left to wait for, or we gave up */
> break;
> }
> }
...is crying out to be a do/while :)
Re: [PATCH] fix child reclaim timing
Posted by Jeff Trawick <tr...@gmail.com>.
On Fri, 13 Aug 2004 10:27:11 -0400, Jeff Trawick <tr...@gmail.com> wrote:
> Here is my take on what is wrong with current code:
>
> 1) It starts complaining a bit too soon. Some third-party modules
> have rather complicated child exit strategies. Whether or not that is
> good or bad (bad ;) ), it results in disturbing messages that wouldn't
> have appeared if we were a little more patient (2-3 seconds). Also, I
> suspect that the use of threaded MPM affects how quickly the children
> are exiting now on Unix.
>
> 2) It should never stop checking for exited processes less often than
> 1-2 seconds, even if it doesn't complain to error log that often.
> Like you say, current code can wait a VERY long time for child
> processes to exit. In practice, I see that it can wait a VERY long
> time even after the last child has exited.
>
> I'll agree that it should never wait so long, though I think around 15
> or so seconds total is reasonable. Exiting before children are gone
> doesn't let Apache start up any more quickly; it just prevents
> potentially-useful information about timing from getting logged to the
> error log.
>
> --/--
>
> I wouldn't complain to error log at all until it has been 2 seconds,
> and then I'd still wait around for 10-15 more. But it has to check
> every second so it finds out soon after all children have exited and
> doesn't sleep needlessly.
>
Here's a patch that does something like I mentioned above, though it
bails out a bit sooner (9 or so seconds). The timing of the
interesting actions in this patch can be tweaked in a much simpler
manner than the old 1.3 code allows.
Re: [PATCH] fix child reclaim timing
Posted by Jeff Trawick <tr...@gmail.com>.
On Fri, 13 Aug 2004 14:51:23 +0100, Joe Orton <jo...@redhat.com> wrote:
> The 2.0 ap_reclaim_child_processes logic seems to be broken - it never
> resets the waittime variable as it did in 1.3; so the parent will wait
> for up to 23 minutes (sic) in total for a stuck child process. (SIGSTOP
> a child and strace the parent to see for yourself)
>
> This updates the logic to be a little more sane:
>
> - at t + 16, 82, 344 ms, just waitpid()
> - at t + 425, 688, 1736 ms, waitpid() else SIGTERM the child
> - at t + 1.74 secs, waitpid() else SIGKILL the child
> - at t + 1.75, 1.82 secs, just waitpid()
> - at t + 2.08 secs, waitpid() else log "this child won't die"
>
> Any comments?
Here is my take on what is wrong with current code:
1) It starts complaining a bit too soon. Some third-party modules
have rather complicated child exit strategies. Whether or not that is
good or bad (bad ;) ), it results in disturbing messages that wouldn't
have appeared if we were a little more patient (2-3 seconds). Also, I
suspect that the use of threaded MPM affects how quickly the children
are exiting now on Unix.
2) It should never stop checking for exited processes less often than
1-2 seconds, even if it doesn't complain to error log that often.
Like you say, current code can wait a VERY long time for child
processes to exit. In practice, I see that it can wait a VERY long
time even after the last child has exited.
I'll agree that it should never wait so long, though I think around 15
or so seconds total is reasonable. Exiting before children are gone
doesn't let Apache start up any more quickly; it just prevents
potentially-useful information about timing from getting logged to the
error log.
--/--
I wouldn't complain to error log at all until it has been 2 seconds,
and then I'd still wait around for 10-15 more. But it has to check
every second so it finds out soon after all children have exited and
doesn't sleep needlessly.