You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Ben Laurie <be...@gonzo.ben.algroup.co.uk> on 1996/11/21 15:49:25 UTC

Re: SIGUSR1

Paul Richards wrote:
> 
> 
> I looked into this last night and this morning and my conclusion is
> that it's totally and utterly broken.
> 
> With the current sources I see *exactly* the same behaviour from both
> FreeBSD and Solaris. We should simply disable it for this release.
> If you accidentally SIGUSR1 a child process you get a fork storm.
> 
> I can't make any sense of the code, it looks to me like when a gracful
> restart occurs the parent will spawn off another batch of children.
> 
> A subsequent -HUP fails I think because the parent is freeing the
> socket but this only seems to happen after a USR1 and I haven't even
> started looking into why this happens so I may off track on that bit.
> 
> This is what happens:
> 
> Script started on Thu Nov 21 14:44:52 1996
> Enable tcsh features
> dpr@tees:~/httpd% ./httpd -d .
> Non-graceful restart, 2855
> Made new socket, 2855
> dpr@tees:~/httpd% ps -aefj | grep httpd
>      dpr  2856  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2857  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2859  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2860  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2855     1  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2858  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2862  2845  2861  2845  1 14:45:11 pts/6    0:00 grep httpd
> dpr@tees:~/httpd% kill -USR1 `cat logs/httpd.pid `
> dpr@tees:~/httpd% ps -aefj | grep httpd
>      dpr  2856  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2870  2845  2869  2845  1 14:45:29 pts/6    0:00 grep httpd
>      dpr  2866  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2857  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2859  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2860  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2855     1  2855  2855  1 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2858  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2868  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2865  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2864  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2867  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
> 
> [ I requested a page from the server at this point which caused child
>   process 2870 to be reaped  and 2872 to be created ]
> 
> dpr@tees:~/httpd% ps -aefj | grep httpd
>      dpr  2856  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2866  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2857  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2859  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2860  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2855     1  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2858  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
>      dpr  2868  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2865  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2864  2855  2855  2855  1 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2867  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
>      dpr  2872  2845  2871  2845  1 14:45:47 pts/6    0:00 grep httpd
> dpr@tees:~/httpd% kill -USR1 `cat logs/httpd.pid `
> dpr@tees:~/httpd%
> I'm dead, 2866
> I'm dead, 2856
> I'm dead, 2865
> I'm dead, 2864
> I'm dead, 2857
> I'm dead, 2858
> I'm dead, 2859
> I'm dead, 2867
> I'm dead, 2868
> I'm dead, 2860
> Non-graceful restart, 2855
> bind: Address already in use
> httpd: could not bind to port 8000
> 
> script done on Thu Nov 21 14:46:12 1996
> 
> Which I guess is because process 2855 is still bound to it since I
> don't see anywhere where it closes that socket.
> 
> Now, looking at the code, the above is what I'd expect to happen:
> 
> (http_main.c, line 1775)
> 
>     if(!one_process && !is_graceful)
>     {
> #ifndef NO_KILLPG
>       if (killpg(pgrp,SIGHUP) < 0)    /* Kill 'em off */
> #else 
>       if (kill(-pgrp,SIGHUP) < 0)
> #endif
>         log_unixerr ("killpg SIGHUP", NULL, NULL, server_conf);
>     }
>     
>     if(is_graceful)
>     {
>     log_error("SIGUSR1 received.  Doing graceful restart",server_conf);
>     kill_cleanups_for_fd(pconf,sd);
>     }
>     else if (sd != -1 || listenmaxfd != -1) {
>     reclaim_child_processes(); /* Not when just starting up */
>     log_error ("SIGHUP received.  Attempting to restart", server_conf);
>     }
> 
> At this point the config files are read and then
> 
>     
>     if (listeners == NULL)
>     {   
>         if(!is_graceful)
>         {
> printf("Non-graceful restart, %d\n", getpid());
>         memset((char *) &sa_server, 0, sizeof(sa_server));
>         sa_server.sin_family=AF_INET;
>         sa_server.sin_addr=bind_address;
>         sa_server.sin_port=htons(server_conf->port);
>         
>         sd = make_sock(pconf, &sa_server);
> printf("Made new socket, %d\n", getpid());
>         }
>     else
>         sd=saved_sd;
> 
> followed by stuff if there are listeners and then it spawns off
> another batch of children.
> 
> I can't see anywhere where children are reaped if the signal is USR1
> and the code always falls through to the point where more are created,
> the offending bit of code is
> 
>    1846         
>    1847     num_children = 0;
>    1848
>    1849     if (daemons_max_free < daemons_min_free + 1) /* Don't thrash... */
>    1850     daemons_max_free = daemons_min_free + 1;
>    1851     
>    1852     while (num_children < daemons_to_start) {
>    1853     make_child(server_conf, num_children++);
>    1854     }
> 
> I wonder if a simple
> 
>  if (!is_graceful)
> 	num_children = 0;
> 
> would help??

This would be a serious error.

> 
> I suspect that by this point the parent is seriously confused since it
> thinks it only has "daemons_to_start" children when it fact it has 
> "daemons_to_start * generations" children. I didn't follow this
> through to the scoreboard so I've no idea what that looks like in
> these circumstances.

The parent is not confused. It is just no longer interested in the "old"
children. They'll die on their own, eventually. Its the "eventually" that is
the real source of the problem, I think.

> 
> Now, I'm not sure exactly what's going on but the line
> 
> 1823     sd=saved_sd;   
> 
> is executed during a SIGUSR1 and I'm betting that this bit of code is
> what is screwing up the parent restart on a subsequent SIGHUP since
> the error suggests that the parent isn't releasing the socket (it's
> the only process still running at that point after a HUP).

Ah. That's a point. Possibly. Certainly it needs to do sd=saved_sd, (though
possibly not if the port has changed, bug?) but it may need to take some action
when subsequently SIGHUPed.

> 
> All in all, this looks very hacked together and unless someone has a
> quick fix we should disable it and take a fresh look at the whole
> forking model for 2.0, which we'd have to do anyway for threading.

I see no mileage in disabling it - if users don't like it they can not use it.
It works fine if SIGUSR1 and SIGHUP are not mixed.

Cheers,

Ben.

> 
> Incidentally, while I was wandering through the code, why
> HARD_SERVER_LIMIT? Shouldn't we know what children we have? If we did
> and we were checking internal state more consistently we'd realise
> something was going wrong.
> 
> 931     for (i = 0; i < HARD_SERVER_LIMIT; ++i) {
> 932     int pid = scoreboard_image->servers[i].pid;
> 933         
> 934     if (pid != my_pid && pid != 0)
> 935         waitpid (scoreboard_image->servers[i].pid, &status, 0);
> 936     } 
> 
> Also, we really need to sort out a style guide, the code I've included
> earlier in this message has a really odd style which I'm betting is
> Ben's and is a nightmare to follow :-) The variations make things very
> hard to follow even when they're just variations of a sane theme and
> not due to a cat jumping on the keyboard :-)
> 
> -- 
>   Paul Richards. Originative Solutions Ltd.  (Netcraft Ltd. contractor)
>   Elsevier Science TIS online journal project.
>   Email: p.richards@elsevier.co.uk
>   Phone: 0370 462071 (Mobile), +44 (0)1865 843155

-- 
Ben Laurie                Phone: +44 (181) 994 6435  Email: ben@algroup.co.uk
Freelance Consultant and  Fax:   +44 (181) 994 6472
Technical Director        URL: http://www.algroup.co.uk/Apache-SSL
A.L. Digital Ltd,         Apache Group member (http://www.apache.org)
London, England.          Apache-SSL author

Re: SIGUSR1

Posted by Paul Richards <p....@elsevier.co.uk>.
Ben Laurie <be...@gonzo.ben.algroup.co.uk> writes:

Did you really need to include my whole message? It was quite long!

> > I suspect that by this point the parent is seriously confused since it
> > thinks it only has "daemons_to_start" children when it fact it has 
> > "daemons_to_start * generations" children. I didn't follow this
> > through to the scoreboard so I've no idea what that looks like in
> > these circumstances.
> 
> The parent is not confused. It is just no longer interested in the "old"
> children. They'll die on their own, eventually. Its the "eventually" that is
> the real source of the problem, I think.

I think this is a design error. I'd prefer the parent to keep track of
processes it spawns and not "trust to luck". If it's a quiet site and
quite a few SIGUSR1's are done, which is possible if some configuration
changes are being made, then you'll end up with an awful lot of
children that may never go away (or at least hang around a long time).

If the number of child processes started is large you could trash the
system with a SIGUSR1, I didn't realise this was how graceful restart
was implemented, it's wrong.

The parent should kill off any children that aren't in use when it
receives a SIGUSR1 and only replace the one's it kills.

The children should ignore SIGUSR1.

> > Now, I'm not sure exactly what's going on but the line
> > 
> > 1823     sd=saved_sd;   
> > 
> > is executed during a SIGUSR1 and I'm betting that this bit of code is
> > what is screwing up the parent restart on a subsequent SIGHUP since
> > the error suggests that the parent isn't releasing the socket (it's
> > the only process still running at that point after a HUP).
> 
> Ah. That's a point. Possibly. Certainly it needs to do sd=saved_sd, (though
> possibly not if the port has changed, bug?) but it may need to take some action
> when subsequently SIGHUPed.
> 
> > 
> > All in all, this looks very hacked together and unless someone has a
> > quick fix we should disable it and take a fresh look at the whole
> > forking model for 2.0, which we'd have to do anyway for threading.
> 
> I see no mileage in disabling it - if users don't like it they can not use it.
> It works fine if SIGUSR1 and SIGHUP are not mixed.

I don't think so, you can trash your system by mistakenly sending a
child the signal instead of the parent and I don't think the behaviour when
the parent is sent SIGUSR1 is entirely safe either for big sites where
they spawn a large number of children or very small sites (like your
development box) where you have very few requests but you reconfigure
the server a lot.

I don't think the mixing of SIGUSR1 and SIGHUP is actually the main
concern, there's some bug with clearing the parent's socket but I think
we could probably fix that during the beta cycle. What I'm really
concerned about is the implementation of graceful restarts in general
because of the above behaviour.

-- 
  Paul Richards. Originative Solutions Ltd.  (Netcraft Ltd. contractor)
  Elsevier Science TIS online journal project.
  Email: p.richards@elsevier.co.uk
  Phone: 0370 462071 (Mobile), +44 (0)1865 843155