You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Justin Erenkrantz <je...@ebuilt.com> on 2001/06/02 01:52:48 UTC

Re: weird problem on Solaris 2.6 - signals / shell ??

Anybody have any thoughts on this?  I haven't received any responses.
This bug is a big PITA on Solaris...

I think this patch might work (haven't tested it), but I'm not really 
sure.  Someone who knows the otherchild code would be able to tell 
for sure.  -- justin

Index: http_main.c
===================================================================
RCS file: /home/cvspublic/apache-1.3/src/main/http_main.c,v
retrieving revision 1.535
diff -u -r1.535 http_main.c
--- http_main.c	2001/04/12 17:49:26	1.535
+++ http_main.c	2001/06/01 23:50:35
@@ -2492,7 +2492,7 @@
 	    waitret = waitpid(ocr->pid, &status, WNOHANG);
 	    if (waitret == ocr->pid) {
 		ocr->pid = -1;
-		(*ocr->maintenance) (OC_REASON_DEATH, ocr->data, (ap_wait_t)status);
+		(*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)status);
 	    }
 	    else if (waitret == 0) {
 		(*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)-1);

On Wed, May 30, 2001 at 12:15:21AM -0700, Justin Erenkrantz wrote:
> Hi gang,
> 
> A few weeks ago, Greg Ames posted a problem that he saw at IBM
> regarding SIGTERM, piped logs, and Solaris (he said 2.6, but turns out
> to be seen on Solaris 2.7 here at eBuilt).  
> 
> For Greg's original message, see:
> 
> http://www.apachelabs.org/apache-mbox/200105.mbox/%3c3B018B6C.FF0CAF0C@remulak.net%3e
> 
> The problem was initially reported against IBM Web Server, but I've now 
> been able to reproduce this with Apache 1.3.21dev (current).
> 
> And, I think I've found the bug.  At least, I think I have.  This one
> isn't easy to explain.  So, *please* bear with me.  If you have any
> questions, email me and I'll try to explain further.  I'm not exactly
> sure what is going on or what the proper long-term solution is.  The
> short-term solution is to avoid using piped ErrorLog on Solaris with
> more than a few piped TransferLogs.  There just seems to be some 
> horrendously odd clashes between the two options causing some race
> conditions.  Either one by themselves are fine, but together causes 
> problems in the shutdown sequence leading to the parent httpd
> not exiting properly (and spawning a few new rotatelogs children).
> 
> The problem appears to be with the initialization of the piped log 
> for the ErrorLog directive *and* the shutdown process in
> reclaim_child_processes.  When an ErrorLog is piped, it is opened by
> error_log_child.  Notice that this is different than the
> piped_log_spawn - my gut tells me if that we make error_log_child
> similar to piped_log_spawn, this entire problem might go away.  But, 
> I'm not familiar enough with the other_child code in 1.3 to tell for
> certain.  This also looks like fairly old code (rse in '96 and
> dgaudet in '97 based on cvs annotate).
> 
> The symptoms that I see on Solaris are thus:
> 
> 1) The rotatelogs process for ErrorLog directive loses its parent
>    after startup.  Hence, it's ppid is init.  I'm not sure how this
>    plays in, but this doesn't look right.  I also can't recreate 2
>    without having ErrorLog be a piped log.
> 
> 2) On shutdown (via SIGTERM), a race condition occurs.  
>    reclaim_child_processes kills all of the httpd children (fine),
>    but the second half of the r_c_p call is a bit odd.  Basically,
>    it calls piped_log_maint with OC_REASON_DEATH - this triggers
>    piped_log_spawn to start up a new child since pl->program isn't 
>    NULL.  This is completely and utterly wrong.  We're supposed to
>    be shutting down, not starting up.  These new children never
>    receive the SIGTERM - they will stick around until another
>    SIGTERM occurs.  The old children have already quit due to the
>    SIGTERM, but the piped_log_spawn starts up new rotatelogs 
>    processes.
> 
> Take away the ErrorLog directive - no visible problems.  Don't have 
> any TransferLog directives?  No visible problem.  I think that the 
> ErrorLog code is broken and we're hitting some sort of signal race 
> condition in reclaim_child_processes for the TransferLog.  How are 
> these interacting?  I haven't a foggy clue.
> 
> Adding:
> 
> ErrorLog "| /foo/apache-1.3/bin/rotatelogs /foo/apache-1.3/logs/piped_error_log 86400"
> 
> causes the problem on Solaris.  You'll probably need a few TransferLog
> directives to cause the inadvertant fork().  When you send SIGTERM to
> the parent httpd process, new rotatelogs are created and hence the
> parent will never quit (these new children didn't receive the SIGTERM).
> Ooooops.
> 
> I've attached the following things to maybe help others understand:
> 
> 1) Truss output of parent httpd process receiving SIGTERM.
>    Notice the fork within the waitpid calls.  I later killed off
>    the rotatelogs and then sent SIGTERM again (it terminated).
> 2) gdb stack trace of reclaim_child_processes
> 3) ps listing showing that there is a rotatelogs with 1 as its
>    parent.
> 
> Am I right in my analysis?  Does this make any sense?  What is the 
> proper fix to this?  Can anyone reproduce this on non-Solaris
> machines?  -- justin

> poll(0xFFBEED48, 0, 1000)			= 0
> time()						= 991198040
> poll(0xFFBEEB98, 7, 0)				= 7
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED48, 0, 1000)			= 0
> time()						= 991198041
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED48, 0, 1000)			= 0
> time()						= 991198042
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
>     Received signal #15, SIGTERM, in poll() [caught]
>       siginfo: SIGTERM pid=12483 uid=7812
> poll(0xFFBEED48, 0, 1000)			Err#4 EINTR
> setcontext(0xFFBEEA30)
> time()						= 991198043
> kill(-12454, SIGTERM)				= 0
>     Received signal #15, SIGTERM [caught]
>       siginfo: SIGTERM pid=12454 uid=7812
> setcontext(0xFFBEEB38)
> poll(0xFFBEED28, 0, 17)				= 0
> waitid(P_PID, 12473, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12474, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12471, SIGTERM)				= 0
> waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12469, SIGTERM)				= 0
> waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12467, SIGTERM)				= 0
> waitid(P_PID, 12465, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()						= 12485
> waitid(P_PID, 12463, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()						= 12487
> waitid(P_PID, 12461, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()						= 12489
> waitid(P_PID, 12459, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()						= 12491
> poll(0xFFBEED28, 0, 66)				= 0
> waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)				= 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)				= 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)				= 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)				= 0
> waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED28, 0, 263)			= 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)				= 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)				= 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)				= 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)				= 0
> poll(0xFFBEED28, 0, 1049)	(sleeping...)
> poll(0xFFBEED28, 0, 1049)			= 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)				= 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)				= 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)				= 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)				= 0
> poll(0xFFBEED28, 0, 4195)	(sleeping...)
> poll(0xFFBEED28, 0, 4195)			= 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)				= 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)				= 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)				= 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)				= 0
> poll(0xFFBEED28, 0, 16778)	(sleeping...)
>     Received signal #15, SIGTERM, in poll() [caught]
>       siginfo: SIGTERM pid=12505 uid=7812
> poll(0xFFBEED28, 0, 16778)			Err#4 EINTR
> setcontext(0xFFBEEA10)
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)				= 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)				= 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)				= 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)				= 0
> poll(0xFFBEED28, 0, 67109)	(sleeping...)
>     Received signal #15, SIGTERM, in poll() [caught]
>       siginfo: SIGTERM pid=27531 uid=7812
> poll(0xFFBEED28, 0, 67109)			Err#4 EINTR
> setcontext(0xFFBEEA10)
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)				= 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)				= 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)				= 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)				= 0
> poll(0xFFBEED28, 0, 268436)	(sleeping...)
> poll(0xFFBEED28, 0, 268436)			= 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> unlink("/home/jerenkrantz/apache-1.3/logs/httpd.pid") = 0
> getpid()					= 12454 [1]
> time()						= 991198337
> write(5, " [ T u e   M a y   2 9  ".., 66)	Err#32 EPIPE
>     Received signal #13, SIGPIPE [ignored]
> close(5)					= 0
> waitid(P_PID, 12457, 0xFFBEEAC0, WEXITED|WTRAPPED|WNOHANG) = 0
> close(21)					= 0
> close(20)					= 0
> close(18)					= 0
> close(19)					= 0
> close(16)					= 0
> close(17)					= 0
> close(13)					= 0
> close(14)					= 0
> close(11)					= 0
> close(12)					= 0
> close(9)					= 0
> close(10)					= 0
> close(7)					= 0
> close(8)					= 0
> close(3)					= 0
> close(6)					= 0
> close(15)					= 0
> llseek(0, 0, SEEK_CUR)				= 0
> _exit(0)

> jerenkra 14112 14111  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14115 14113  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14118 14117  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14110 14108  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14111 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14103 14102  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14107 14106  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14106 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14108 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14104 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14114 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14127 13768  0 22:49:20 pts/5    0:00 grep rotate
> jerenkra 14116 14114  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14117 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14113 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14082     1  0 22:48:47 pts/5    0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14102 14097  0 22:48:50 ?        0:00 /bin/sh -c  /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14105 14104  0 22:48:50 ?        0:00 /home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i

> #0  piped_log_spawn (pl=0xce420) at http_log.c:611
> #1  0x4c190 in piped_log_maintenance (reason=0, data=0xce420, status=13) at http_log.c:658
> #2  0x4e85c in reclaim_child_processes (terminate=1) at http_main.c:2495
> #3  0x51bd8 in standalone_main (argc=1, argv=0xffbeefe4) at http_main.c:4779
> #4  0x52484 in main (argc=1, argv=0xffbeefe4) at http_main.c:5010


Re: weird problem on Solaris 2.6 - signals / shell ??

Posted by Greg Ames <gr...@remulak.net>.
Justin Erenkrantz wrote:
> 
> Anybody have any thoughts on this?  I haven't received any responses.
> This bug is a big PITA on Solaris...
> 

ok, I'm starting to catch up with you on this.  Thanks a bunch for all
the digging you did :-)

> I think this patch might work (haven't tested it), but I'm not really
> sure.  

I will test this...it looks quite reasonable after reading your analysis
and some of the 1.3 code.  Unfortunately I don't have a Solaris 2.7 
account, but if it fixes Solaris 2.6 and 8, it ought to make the 2.7 
customer happy.  

>          Someone who knows the otherchild code would be able to tell
> for sure.  -- justin

1.3 otherchild didn't look too bad when I plowed thru it earlier.  
However, it seems like there was a misunderstanding of when 
to use the various OC_REASON_blah's.

> 
> Index: http_main.c
> ===================================================================
> RCS file: /home/cvspublic/apache-1.3/src/main/http_main.c,v
> retrieving revision 1.535
> diff -u -r1.535 http_main.c
> --- http_main.c 2001/04/12 17:49:26     1.535
> +++ http_main.c 2001/06/01 23:50:35
> @@ -2492,7 +2492,7 @@
>             waitret = waitpid(ocr->pid, &status, WNOHANG);
>             if (waitret == ocr->pid) {
>                 ocr->pid = -1;
> -               (*ocr->maintenance) (OC_REASON_DEATH, ocr->data, (ap_wait_t)status);
> +               (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)status);
>             }
>             else if (waitret == 0) {
>                 (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)-1);
> 

> >
> > The symptoms that I see on Solaris are thus:
> >
> > 1) The rotatelogs process for ErrorLog directive loses its parent
> >    after startup.  Hence, it's ppid is init.  I'm not sure how this
> >    plays in, but this doesn't look right.  I also can't recreate 2
> >    without having ErrorLog be a piped log.
> >

that seems bogus all right, but I think your fix for 2) will cure our 
immediate pain.

> > 2) On shutdown (via SIGTERM), a race condition occurs.
> >    reclaim_child_processes kills all of the httpd children (fine),
> >    but the second half of the r_c_p call is a bit odd.  Basically,
> >    it calls piped_log_maint with OC_REASON_DEATH - this triggers
> >    piped_log_spawn to start up a new child since pl->program isn't
> >    NULL.  This is completely and utterly wrong.  We're supposed to
> >    be shutting down, not starting up.  These new children never
> >    receive the SIGTERM - they will stick around until another
> >    SIGTERM occurs.  The old children have already quit due to the
> >    SIGTERM, but the piped_log_spawn starts up new rotatelogs
> >    processes.
> >

Re: weird problem on Solaris 2.6 - signals / shell ??

Posted by Greg Ames <gr...@remulak.net>.
Justin Erenkrantz wrote:
> 
> Anybody have any thoughts on this?  I haven't received any responses.
> This bug is a big PITA on Solaris...
> 
> I think this patch might work (haven't tested it), but I'm not really
> sure.  Someone who knows the otherchild code would be able to tell
> for sure.  -- justin
> 

I tested it - works like a charm - and committed it.  Thanks much,
Justin!