You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Rob Hartill <ro...@imdb.com> on 1997/01/31 01:54:49 UTC

[SHOWSTOPPER] SIGHUP caused a fatal and silent crash

My old friends at LANL just reported their 1.2b6 silently died after a
SIGHUP. A SIGHUP later on worked. They'd just upgraded to 1.2b6 from a
b5-dev I think.

That with HPUX.

Re: [SHOWSTOPPER] SIGHUP caused a fatal and silent crash

Posted by Ed Korthof <ed...@organic.com>.

I saw this at least as early as 1.2b2; I know there have been some changes
since, but I think none which would create this behavior, since the parent
is actually waiting for each of the children (if'd it'd forgotten, things
might be fine since it'd go on with the restart procedure (of course, it
probably couldn't bind the address, so it'd probably die outright)).

When we've seen the problem, it sufficied to kill the children manually (I
haven't tried w/ SIGHUP yet, but will the next chance I get... have been
using SIGKILL);  then the server restarts.  However, sending a SIGHUP to
the parent doesn't have any effect.

     -- Ed Korthof        |  Web Server Engineer --
     -- ed@organic.com    |  Organic Online, Inc --
     -- (415) 278-5676    |  Fax: (415) 284-6891 --

On Mon, 3 Feb 1997, Rob Hartill wrote:

> On Mon, 3 Feb 1997, Marc Slemko wrote:
> 
> > And FreeBSD.  What I see is all the children (a hundred or two) but
> > (generally) one as zombies waiting for the parent to collect
> > exit status.  The one process still left is going on happily
> > servicing the request it is on; I have seen it send several documents
> > on a keepalive connection, not sure if it gets any more requests after
> > that.
> 
> Mmmm, there was a recent change that affected the *number* of children.
> I wonder if the parent has lost track of some of its offspring.
> 
>

Re: [SHOWSTOPPER] SIGHUP caused a fatal and silent crash

Posted by Rob Hartill <ro...@imdb.com>.

On Mon, 3 Feb 1997, Marc Slemko wrote:

> And FreeBSD.  What I see is all the children (a hundred or two) but
> (generally) one as zombies waiting for the parent to collect
> exit status.  The one process still left is going on happily
> servicing the request it is on; I have seen it send several documents
> on a keepalive connection, not sure if it gets any more requests after
> that.

Mmmm, there was a recent change that affected the *number* of children.
I wonder if the parent has lost track of some of its offspring.

Re: [SHOWSTOPPER] SIGHUP caused a fatal and silent crash

Posted by Marc Slemko <ma...@znep.com>.

And FreeBSD.  What I see is all the children (a hundred or two) but
(generally) one as zombies waiting for the parent to collect
exit status.  The one process still left is going on happily
servicing the request it is on; I have seen it send several documents
on a keepalive connection, not sure if it gets any more requests after
that.

My suspicions are simply that a signal is being lost somewhere.  If I
send a HUP to the child that is hanging around, it exits and
everything goes on like magic.  That means we either need to find
out why it is being lost or make the parent retry the signal or,
preferrably, both.

I will make a guess that the parent gets stuck in:

void reclaim_child_processes ()
{
    int i, status;
    int my_pid = getpid();

    sync_scoreboard_image();
    for (i = 0; i < HARD_SERVER_LIMIT; ++i) {
        int pid = scoreboard_image->servers[i].pid;

        if (pid != my_pid && pid != 0)
            waitpid (scoreboard_image->servers[i].pid, &status, 0);
    }
}

...in the waitpid().  Lets say an put an alarm() before the waitpid() that
drops into a routine which sends another HUP to the children and
then tries this again.  Some sort of counter would be good so it
could send progressively stronger signals the more it was called, or 
eventually just skip that child and log a warning.

Or is WNOHANG portable enough and useful?  Could perhaps work that in
without needing an ugly signal handler.

On Thu, 30 Jan 1997, Dean Gaudet wrote:

> Ditto on IRIX 5.3 with 1.1.1.  I didn't bother reporting it 'cause
> graceful restart was being worked on.
> 
> Dean
> 
> On Thu, 30 Jan 1997, Ed Korthof wrote:
> 
> > While someone is looking at this -- on Solaris 2.5, on a heavily loaded
> > server, for some reason SIGHUP frequently fails to take down all the
> > children.  The server then hangs with a small number of children, and
> > waits for them to finish MaxRequestsPerClient.  Easily reproducible.
> > 
> >      -- Ed Korthof        |  Web Server Engineer --
> >      -- ed@organic.com    |  Organic Online, Inc --
> >      -- (415) 278-5676    |  Fax: (415) 284-6891 --
> > 
> > On Fri, 31 Jan 1997, Rob Hartill wrote:
> > 
> > > 
> > > My old friends at LANL just reported their 1.2b6 silently died after a
> > > SIGHUP. A SIGHUP later on worked. They'd just upgraded to 1.2b6 from a
> > > b5-dev I think.
> > > 
> > > That with HPUX.
> > > 
> > > 
> > 
> > 
>

Re: [SHOWSTOPPER] SIGHUP caused a fatal and silent crash

Posted by Dean Gaudet <dg...@arctic.org>.

Ditto on IRIX 5.3 with 1.1.1.  I didn't bother reporting it 'cause
graceful restart was being worked on.

Dean

On Thu, 30 Jan 1997, Ed Korthof wrote:

> While someone is looking at this -- on Solaris 2.5, on a heavily loaded
> server, for some reason SIGHUP frequently fails to take down all the
> children.  The server then hangs with a small number of children, and
> waits for them to finish MaxRequestsPerClient.  Easily reproducible.
> 
>      -- Ed Korthof        |  Web Server Engineer --
>      -- ed@organic.com    |  Organic Online, Inc --
>      -- (415) 278-5676    |  Fax: (415) 284-6891 --
> 
> On Fri, 31 Jan 1997, Rob Hartill wrote:
> 
> > 
> > My old friends at LANL just reported their 1.2b6 silently died after a
> > SIGHUP. A SIGHUP later on worked. They'd just upgraded to 1.2b6 from a
> > b5-dev I think.
> > 
> > That with HPUX.
> > 
> > 
> 
>

Re: [SHOWSTOPPER] SIGHUP caused a fatal and silent crash

Posted by Ed Korthof <ed...@organic.com>.

While someone is looking at this -- on Solaris 2.5, on a heavily loaded
server, for some reason SIGHUP frequently fails to take down all the
children.  The server then hangs with a small number of children, and
waits for them to finish MaxRequestsPerClient.  Easily reproducible.

     -- Ed Korthof        |  Web Server Engineer --
     -- ed@organic.com    |  Organic Online, Inc --
     -- (415) 278-5676    |  Fax: (415) 284-6891 --

On Fri, 31 Jan 1997, Rob Hartill wrote:

> 
> My old friends at LANL just reported their 1.2b6 silently died after a
> SIGHUP. A SIGHUP later on worked. They'd just upgraded to 1.2b6 from a
> b5-dev I think.
> 
> That with HPUX.
> 
>