You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Justin Erenkrantz <je...@ebuilt.com> on 2001/06/02 01:52:48 UTC
Re: weird problem on Solaris 2.6 - signals / shell ??
Anybody have any thoughts on this? I haven't received any responses.
This bug is a big PITA on Solaris...
I think this patch might work (haven't tested it), but I'm not really
sure. Someone who knows the otherchild code would be able to tell
for sure. -- justin
Index: http_main.c
===================================================================
RCS file: /home/cvspublic/apache-1.3/src/main/http_main.c,v
retrieving revision 1.535
diff -u -r1.535 http_main.c
--- http_main.c 2001/04/12 17:49:26 1.535
+++ http_main.c 2001/06/01 23:50:35
@@ -2492,7 +2492,7 @@
waitret = waitpid(ocr->pid, &status, WNOHANG);
if (waitret == ocr->pid) {
ocr->pid = -1;
- (*ocr->maintenance) (OC_REASON_DEATH, ocr->data, (ap_wait_t)status);
+ (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)status);
}
else if (waitret == 0) {
(*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)-1);
On Wed, May 30, 2001 at 12:15:21AM -0700, Justin Erenkrantz wrote:
> Hi gang,
>
> A few weeks ago, Greg Ames posted a problem that he saw at IBM
> regarding SIGTERM, piped logs, and Solaris (he said 2.6, but turns out
> to be seen on Solaris 2.7 here at eBuilt).
>
> For Greg's original message, see:
>
> http://www.apachelabs.org/apache-mbox/200105.mbox/%3c3B018B6C.FF0CAF0C@remulak.net%3e
>
> The problem was initially reported against IBM Web Server, but I've now
> been able to reproduce this with Apache 1.3.21dev (current).
>
> And, I think I've found the bug. At least, I think I have. This one
> isn't easy to explain. So, *please* bear with me. If you have any
> questions, email me and I'll try to explain further. I'm not exactly
> sure what is going on or what the proper long-term solution is. The
> short-term solution is to avoid using piped ErrorLog on Solaris with
> more than a few piped TransferLogs. There just seems to be some
> horrendously odd clashes between the two options causing some race
> conditions. Either one by themselves are fine, but together causes
> problems in the shutdown sequence leading to the parent httpd
> not exiting properly (and spawning a few new rotatelogs children).
>
> The problem appears to be with the initialization of the piped log
> for the ErrorLog directive *and* the shutdown process in
> reclaim_child_processes. When an ErrorLog is piped, it is opened by
> error_log_child. Notice that this is different than the
> piped_log_spawn - my gut tells me if that we make error_log_child
> similar to piped_log_spawn, this entire problem might go away. But,
> I'm not familiar enough with the other_child code in 1.3 to tell for
> certain. This also looks like fairly old code (rse in '96 and
> dgaudet in '97 based on cvs annotate).
>
> The symptoms that I see on Solaris are thus:
>
> 1) The rotatelogs process for ErrorLog directive loses its parent
> after startup. Hence, it's ppid is init. I'm not sure how this
> plays in, but this doesn't look right. I also can't recreate 2
> without having ErrorLog be a piped log.
>
> 2) On shutdown (via SIGTERM), a race condition occurs.
> reclaim_child_processes kills all of the httpd children (fine),
> but the second half of the r_c_p call is a bit odd. Basically,
> it calls piped_log_maint with OC_REASON_DEATH - this triggers
> piped_log_spawn to start up a new child since pl->program isn't
> NULL. This is completely and utterly wrong. We're supposed to
> be shutting down, not starting up. These new children never
> receive the SIGTERM - they will stick around until another
> SIGTERM occurs. The old children have already quit due to the
> SIGTERM, but the piped_log_spawn starts up new rotatelogs
> processes.
>
> Take away the ErrorLog directive - no visible problems. Don't have
> any TransferLog directives? No visible problem. I think that the
> ErrorLog code is broken and we're hitting some sort of signal race
> condition in reclaim_child_processes for the TransferLog. How are
> these interacting? I haven't a foggy clue.
>
> Adding:
>
> ErrorLog "| /foo/apache-1.3/bin/rotatelogs /foo/apache-1.3/logs/piped_error_log 86400"
>
> causes the problem on Solaris. You'll probably need a few TransferLog
> directives to cause the inadvertant fork(). When you send SIGTERM to
> the parent httpd process, new rotatelogs are created and hence the
> parent will never quit (these new children didn't receive the SIGTERM).
> Ooooops.
>
> I've attached the following things to maybe help others understand:
>
> 1) Truss output of parent httpd process receiving SIGTERM.
> Notice the fork within the waitpid calls. I later killed off
> the rotatelogs and then sent SIGTERM again (it terminated).
> 2) gdb stack trace of reclaim_child_processes
> 3) ps listing showing that there is a rotatelogs with 1 as its
> parent.
>
> Am I right in my analysis? Does this make any sense? What is the
> proper fix to this? Can anyone reproduce this on non-Solaris
> machines? -- justin
> poll(0xFFBEED48, 0, 1000) = 0
> time() = 991198040
> poll(0xFFBEEB98, 7, 0) = 7
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED48, 0, 1000) = 0
> time() = 991198041
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED48, 0, 1000) = 0
> time() = 991198042
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> Received signal #15, SIGTERM, in poll() [caught]
> siginfo: SIGTERM pid=12483 uid=7812
> poll(0xFFBEED48, 0, 1000) Err#4 EINTR
> setcontext(0xFFBEEA30)
> time() = 991198043
> kill(-12454, SIGTERM) = 0
> Received signal #15, SIGTERM [caught]
> siginfo: SIGTERM pid=12454 uid=7812
> setcontext(0xFFBEEB38)
> poll(0xFFBEED28, 0, 17) = 0
> waitid(P_PID, 12473, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12474, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12471, SIGTERM) = 0
> waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12469, SIGTERM) = 0
> waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12467, SIGTERM) = 0
> waitid(P_PID, 12465, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork() = 12485
> waitid(P_PID, 12463, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork() = 12487
> waitid(P_PID, 12461, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork() = 12489
> waitid(P_PID, 12459, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork() = 12491
> poll(0xFFBEED28, 0, 66) = 0
> waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM) = 0
> waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED28, 0, 263) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM) = 0
> poll(0xFFBEED28, 0, 1049) (sleeping...)
> poll(0xFFBEED28, 0, 1049) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM) = 0
> poll(0xFFBEED28, 0, 4195) (sleeping...)
> poll(0xFFBEED28, 0, 4195) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM) = 0
> poll(0xFFBEED28, 0, 16778) (sleeping...)
> Received signal #15, SIGTERM, in poll() [caught]
> siginfo: SIGTERM pid=12505 uid=7812
> poll(0xFFBEED28, 0, 16778) Err#4 EINTR
> setcontext(0xFFBEEA10)
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM) = 0
> poll(0xFFBEED28, 0, 67109) (sleeping...)
> Received signal #15, SIGTERM, in poll() [caught]
> siginfo: SIGTERM pid=27531 uid=7812
> poll(0xFFBEED28, 0, 67109) Err#4 EINTR
> setcontext(0xFFBEEA10)
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM) = 0
> poll(0xFFBEED28, 0, 268436) (sleeping...)
> poll(0xFFBEED28, 0, 268436) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> unlink("/home/jerenkrantz/apache-1.3/logs/httpd.pid") = 0
> getpid() = 12454 [1]
> time() = 991198337
> write(5, " [ T u e M a y 2 9 ".., 66) Err#32 EPIPE
> Received signal #13, SIGPIPE [ignored]
> close(5) = 0
> waitid(P_PID, 12457, 0xFFBEEAC0, WEXITED|WTRAPPED|WNOHANG) = 0
> close(21) = 0
> close(20) = 0
> close(18) = 0
> close(19) = 0
> close(16) = 0
> close(17) = 0
> close(13) = 0
> close(14) = 0
> close(11) = 0
> close(12) = 0
> close(9) = 0
> close(10) = 0
> close(7) = 0
> close(8) = 0
> close(3) = 0
> close(6) = 0
> close(15) = 0
> llseek(0, 0, SEEK_CUR) = 0
> _exit(0)
> jerenkra 14112 14111 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14115 14113 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14118 14117 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14110 14108 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14111 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14103 14102 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14107 14106 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14106 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14108 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14104 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14114 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14127 13768 0 22:49:20 pts/5 0:00 grep rotate
> jerenkra 14116 14114 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14117 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14113 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14082 1 0 22:48:47 pts/5 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14102 14097 0 22:48:50 ? 0:00 /bin/sh -c /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14105 14104 0 22:48:50 ? 0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> #0 piped_log_spawn (pl=0xce420) at http_log.c:611
> #1 0x4c190 in piped_log_maintenance (reason=0, data=0xce420, status=13) at http_log.c:658
> #2 0x4e85c in reclaim_child_processes (terminate=1) at http_main.c:2495
> #3 0x51bd8 in standalone_main (argc=1, argv=0xffbeefe4) at http_main.c:4779
> #4 0x52484 in main (argc=1, argv=0xffbeefe4) at http_main.c:5010
Re: weird problem on Solaris 2.6 - signals / shell ??
Posted by Greg Ames <gr...@remulak.net>.
Justin Erenkrantz wrote:
>
> Anybody have any thoughts on this? I haven't received any responses.
> This bug is a big PITA on Solaris...
>
ok, I'm starting to catch up with you on this. Thanks a bunch for all
the digging you did :-)
> I think this patch might work (haven't tested it), but I'm not really
> sure.
I will test this...it looks quite reasonable after reading your analysis
and some of the 1.3 code. Unfortunately I don't have a Solaris 2.7
account, but if it fixes Solaris 2.6 and 8, it ought to make the 2.7
customer happy.
> Someone who knows the otherchild code would be able to tell
> for sure. -- justin
1.3 otherchild didn't look too bad when I plowed thru it earlier.
However, it seems like there was a misunderstanding of when
to use the various OC_REASON_blah's.
>
> Index: http_main.c
> ===================================================================
> RCS file: /home/cvspublic/apache-1.3/src/main/http_main.c,v
> retrieving revision 1.535
> diff -u -r1.535 http_main.c
> --- http_main.c 2001/04/12 17:49:26 1.535
> +++ http_main.c 2001/06/01 23:50:35
> @@ -2492,7 +2492,7 @@
> waitret = waitpid(ocr->pid, &status, WNOHANG);
> if (waitret == ocr->pid) {
> ocr->pid = -1;
> - (*ocr->maintenance) (OC_REASON_DEATH, ocr->data, (ap_wait_t)status);
> + (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)status);
> }
> else if (waitret == 0) {
> (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)-1);
>
> >
> > The symptoms that I see on Solaris are thus:
> >
> > 1) The rotatelogs process for ErrorLog directive loses its parent
> > after startup. Hence, it's ppid is init. I'm not sure how this
> > plays in, but this doesn't look right. I also can't recreate 2
> > without having ErrorLog be a piped log.
> >
that seems bogus all right, but I think your fix for 2) will cure our
immediate pain.
> > 2) On shutdown (via SIGTERM), a race condition occurs.
> > reclaim_child_processes kills all of the httpd children (fine),
> > but the second half of the r_c_p call is a bit odd. Basically,
> > it calls piped_log_maint with OC_REASON_DEATH - this triggers
> > piped_log_spawn to start up a new child since pl->program isn't
> > NULL. This is completely and utterly wrong. We're supposed to
> > be shutting down, not starting up. These new children never
> > receive the SIGTERM - they will stick around until another
> > SIGTERM occurs. The old children have already quit due to the
> > SIGTERM, but the piped_log_spawn starts up new rotatelogs
> > processes.
> >
Re: weird problem on Solaris 2.6 - signals / shell ??
Posted by Greg Ames <gr...@remulak.net>.
Justin Erenkrantz wrote:
>
> Anybody have any thoughts on this? I haven't received any responses.
> This bug is a big PITA on Solaris...
>
> I think this patch might work (haven't tested it), but I'm not really
> sure. Someone who knows the otherchild code would be able to tell
> for sure. -- justin
>
I tested it - works like a charm - and committed it. Thanks much,
Justin!