You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by "William A. Rowe, Jr." <wr...@rowe-clan.net> on 2009/01/06 19:10:25 UTC

Graceful restart not so graceful?

Would folks comment on Nathan's, Joe's and Stefan's work on

https://issues.apache.org/bugzilla/show_bug.cgi?id=42829

and offer any comments on why this patch;

https://issues.apache.org/bugzilla/attachment.cgi?id=22822

should not be applied to trunk/ and branches/2.2.x/ in time for
the next releases?

If I hear no objections by noon Jan 8 I'll apply.  According to
Stefan this current patch is improving the situation.

Bill


Re: Graceful restart not so graceful?

Posted by Jeff Trawick <tr...@gmail.com>.
On Tue, Jan 6, 2009 at 1:10 PM, William A. Rowe, Jr. <wr...@rowe-clan.net>wrote:

> Would folks comment on Nathan's, Joe's and Stefan's work on
>
> https://issues.apache.org/bugzilla/show_bug.cgi?id=42829
>
> and offer any comments on why this patch;
>
> https://issues.apache.org/bugzilla/attachment.cgi?id=22822
>
> should not be applied to trunk/ and branches/2.2.x/ in time for
> the next releases?
>
> If I hear no objections by noon Jan 8 I'll apply.  According to
> Stefan this current patch is improving the situation.
>
> Bill
>
> See also http://marc.info/?l=apache-httpd-dev&m=119945416529706&w=2

Re: Graceful restart not so graceful?

Posted by Ruediger Pluem <rp...@apache.org>.

On 01/06/2009 07:10 PM, William A. Rowe, Jr. wrote:
> Would folks comment on Nathan's, Joe's and Stefan's work on
> 
> https://issues.apache.org/bugzilla/show_bug.cgi?id=42829
> 
> and offer any comments on why this patch;
> 
> https://issues.apache.org/bugzilla/attachment.cgi?id=22822
> 
> should not be applied to trunk/ and branches/2.2.x/ in time for
> the next releases?

@@ -573,6 +588,11 @@ static void child_main(int child_num_arg
                 apr_int32_t numdesc;
                 const apr_pollfd_t *pdesc;

+                if (die_now) {
+                    status = !APR_SUCCESS;
+                    goto unlock;
+                }
+

Hm, what happens if we get the signal exactly here?
Then stop_listening only set die_now to 1. So we reenter the poll and stay in
until we get a new request to OUR process on one of the listeners.
Is this the desired behaviour?

                 /* timeout == -1 == wait forever */
                 status = apr_pollset_poll(pollset, -1, &numdesc, &pdesc);
                 if (status != APR_SUCCESS) {

How about this one instead:

Index: server/mpm/prefork/prefork.c
===================================================================
--- server/mpm/prefork/prefork.c        (Revision 731601)
+++ server/mpm/prefork/prefork.c        (Arbeitskopie)
@@ -99,6 +99,8 @@
 static int mpm_state = AP_MPMQ_STARTING;
 static ap_pod_t *pod;

+static int dummy_listener;
+
 /*
  * The max child slot ever assigned, preserved across restarts.  Necessary
  * to deal with MaxClients changes across AP_SIG_GRACEFUL restarts.  We
@@ -136,6 +138,8 @@
 #endif /* TPF */

 static volatile int die_now = 0;
+static volatile int listeners_closed = 0;
+static int close_connection = 1;

 #ifdef GPROF
 /*
@@ -304,8 +308,19 @@

 static void stop_listening(int sig)
 {
-    ap_close_listeners();
+    if (close_connection && !listeners_closed) {
+        ap_listen_rec *lr;

+        for (lr = ap_listeners; lr; lr = lr->next) {
+            apr_os_sock_t fd;
+
+            apr_os_sock_get(&fd, lr->sd);
+
+            dup2(dummy_listener, fd);
+        }
+        listeners_closed = 1;
+    }
+
     /* For a graceful stop, we want the child to exit when done */
     die_now = 1;
 }
@@ -533,6 +548,7 @@
         if (num_listensocks == 1) {
             /* There is only one listener record, so refer to that one. */
             lr = ap_listeners;
+            close_connection = 0;
         }
         else {
             /* multiple listening sockets - need to poll */
@@ -540,8 +556,14 @@
                 apr_int32_t numdesc;
                 const apr_pollfd_t *pdesc;

+                if (die_now) {
+                    status = !APR_SUCCESS;
+                    goto unlock;
+                }
+
                 /* timeout == -1 == wait forever */
                 status = apr_pollset_poll(pollset, -1, &numdesc, &pdesc);
+                close_connection = 0;
                 if (status != APR_SUCCESS) {
                     if (APR_STATUS_IS_EINTR(status)) {
                         if (one_process && shutdown_pending) {
@@ -595,8 +617,15 @@
         /* if we accept() something we don't want to die, so we have to
          * defer the exit
          */
-        status = lr->accept_func(&csd, lr, ptrans);
+        if (!die_now) {
+            status = lr->accept_func(&csd, lr, ptrans);
+        }
+        else {
+            status = !APR_SUCCESS;
+        }

+    unlock:
+        close_connection = 1;
         SAFE_ACCEPT(accept_mutex_off());      /* unlock after "accept" */

         if (status == APR_EGENERAL) {
@@ -612,6 +641,11 @@
          * socket options, file descriptors, and read/write buffers.
          */

+        if (die_now && !listeners_closed) {
+            ap_close_listeners();
+            listeners_closed = 1;
+        }
+
         current_conn = ap_run_create_connection(ptrans, ap_server_conf, csd, my_child_num, sbh, bucket_alloc);
         if (current_conn) {
             ap_process_connection(current_conn, csd);
@@ -890,6 +924,17 @@
         mpm_state = AP_MPMQ_STOPPING;
         return 1;
     }
+
+    if (dummy_listener == 0) {
+        dummy_listener = socket(AF_INET, SOCK_STREAM, 0);
+        if (dummy_listener < 0) {
+            rv = errno;
+            ap_log_error(APLOG_MARK, APLOG_EMERG, rv, s,
+                         "Couldn't create dummy listener socket");
+            mpm_state = AP_MPMQ_STOPPING;
+            return 1;
+        }
+    }

 #if APR_USE_SYSVSEM_SERIALIZE
     if (ap_accept_lock_mech == APR_LOCK_DEFAULT ||



Regards

Rüdiger


Re: Graceful restart not so graceful?

Posted by "Plüm, Rüdiger, VF-Group" <ru...@vodafone.com>.
 

> -----Ursprüngliche Nachricht-----
> Von: Jeff Trawick [mailto:trawick@gmail.com] 
> Gesendet: Mittwoch, 7. Januar 2009 13:03
> An: dev@httpd.apache.org
> Betreff: Re: Graceful restart not so graceful?
> 
> On Wed, Jan 7, 2009 at 6:03 AM, Joe Orton <jo...@redhat.com> wrote:
> 
> 
> 	On Tue, Jan 06, 2009 at 12:10:25PM -0600, William Rowe wrote:
> 	> Would folks comment on Nathan's, Joe's and Stefan's work on
> 	>
> 	> https://issues.apache.org/bugzilla/show_bug.cgi?id=42829
> 	>
> 	> and offer any comments on why this patch;
> 	>
> 	> https://issues.apache.org/bugzilla/attachment.cgi?id=22822
> 	>
> 	> should not be applied to trunk/ and branches/2.2.x/ 
> in time for
> 	> the next releases?
> 	
> 	
> 	I never found the time to commit this for a bunch of reasons:
> 	
> 	1) I never convinced myself that adding a bunch of complexity to
> 	prefork, per Stefan's patch, was the right way to address this
> 	
> 	2) I never worked out what impact the lack of check on the poll
> 	descriptor event type (POLLERR etc) had on this, but that needs
> 	resolving too
> 	
> 	3) I hadn't checked whether worker et al had similar issues
> 	
> 	4) I think that a simpler solution to this problem 
> would be to add a
> 	timeout (e.g. 30s) on the child pollset_poll() call so they are
> 	guaranteed to pop at some point, but I haven't had time 
> to try this or
> 	work out whether that will suck in some other way.
> 
> 
> Initial testing of your idea for a timeout was promising.  
> I'll try to test it with graceful stop as well.
> 
> Index: server/mpm/prefork/prefork.c
> ============================== 
> =====================================
> --- server/mpm/prefork/prefork.c    (revision 731724)
> +++ server/mpm/prefork/prefork.c    (working copy)
> @@ -540,10 +540,12 @@
>                  apr_int32_t numdesc;
>                  const apr_pollfd_t *pdesc;
>  
> -                /* timeout == -1 == wait forever */
> -                status = apr_pollset_poll(pollset, -1, 
> &numdesc, &pdesc);
> +                /* timeout == 10 seconds to avoid a hang at 
> graceful restart/stop
> +                 * caused by the closing of sockets by the 
> signal handler
> +                 */
> +                status = apr_pollset_poll(pollset, 
> apr_time_from_sec(10), &numdesc, &pdesc);
>                  if (status != APR_SUCCESS) {
> -                    if (APR_STATUS_IS_EINTR(status)) {
> +                    if (APR_STATUS_IS_TIMEUP(status) || 
> APR_STATUS_IS_EINTR(status)) {
>                          if (one_process && shutdown_pending) {
>                              return;
>                          }

Don't we need to add

SAFE_ACCEPT(accept_mutex_off());

before the continue later on in the 

if (APR_STATUS_IS_TIMEUP(status) || APR_STATUS_IS_EINTR(status)) {

block or is it guaranteed that if we allready have the lock
that another SAFE_ACCEPT(accept_mutex_on()); doesn't do any harm?

Regards

Rüdiger


Re: Graceful restart not so graceful?

Posted by Stefan Fritsch <sf...@sfritsch.de>.
Hi,

thanks for following up on this and sorry for the late response.

On Wednesday 07 January 2009, Jeff Trawick wrote:
> Initial testing of your idea for a timeout was promising.

I couldn't reproduce any hangs under linux with the patch you commited 
to trunk.

In my patch I tried to avoid that an idle apache wakes periodically 
for no good reasons. But if you don't think that is a problem, please 
backport your patch to 2.2.x.

Cheers,
Stefan

Re: Graceful restart not so graceful?

Posted by Jeff Trawick <tr...@gmail.com>.
On Wed, Jan 7, 2009 at 6:03 AM, Joe Orton <jo...@redhat.com> wrote:

> On Tue, Jan 06, 2009 at 12:10:25PM -0600, William Rowe wrote:
> > Would folks comment on Nathan's, Joe's and Stefan's work on
> >
> > https://issues.apache.org/bugzilla/show_bug.cgi?id=42829
> >
> > and offer any comments on why this patch;
> >
> > https://issues.apache.org/bugzilla/attachment.cgi?id=22822
> >
> > should not be applied to trunk/ and branches/2.2.x/ in time for
> > the next releases?
>
> I never found the time to commit this for a bunch of reasons:
>
> 1) I never convinced myself that adding a bunch of complexity to
> prefork, per Stefan's patch, was the right way to address this
>
> 2) I never worked out what impact the lack of check on the poll
> descriptor event type (POLLERR etc) had on this, but that needs
> resolving too
>
> 3) I hadn't checked whether worker et al had similar issues
>
> 4) I think that a simpler solution to this problem would be to add a
> timeout (e.g. 30s) on the child pollset_poll() call so they are
> guaranteed to pop at some point, but I haven't had time to try this or
> work out whether that will suck in some other way.


Initial testing of your idea for a timeout was promising.  I'll try to test
it with graceful stop as well.

Index: server/mpm/prefork/prefork.c
===================================================================
--- server/mpm/prefork/prefork.c    (revision 731724)
+++ server/mpm/prefork/prefork.c    (working copy)
@@ -540,10 +540,12 @@
                 apr_int32_t numdesc;
                 const apr_pollfd_t *pdesc;

-                /* timeout == -1 == wait forever */
-                status = apr_pollset_poll(pollset, -1, &numdesc, &pdesc);
+                /* timeout == 10 seconds to avoid a hang at graceful
restart/stop
+                 * caused by the closing of sockets by the signal handler
+                 */
+                status = apr_pollset_poll(pollset, apr_time_from_sec(10),
&numdesc, &pdesc);
                 if (status != APR_SUCCESS) {
-                    if (APR_STATUS_IS_EINTR(status)) {
+                    if (APR_STATUS_IS_TIMEUP(status) ||
APR_STATUS_IS_EINTR(status)) {
                         if (one_process && shutdown_pending) {
                             return;
                         }

Re: Graceful restart not so graceful?

Posted by Joe Orton <jo...@redhat.com>.
On Tue, Jan 06, 2009 at 12:10:25PM -0600, William Rowe wrote:
> Would folks comment on Nathan's, Joe's and Stefan's work on
> 
> https://issues.apache.org/bugzilla/show_bug.cgi?id=42829
> 
> and offer any comments on why this patch;
> 
> https://issues.apache.org/bugzilla/attachment.cgi?id=22822
> 
> should not be applied to trunk/ and branches/2.2.x/ in time for
> the next releases?

I never found the time to commit this for a bunch of reasons:

1) I never convinced myself that adding a bunch of complexity to 
prefork, per Stefan's patch, was the right way to address this

2) I never worked out what impact the lack of check on the poll 
descriptor event type (POLLERR etc) had on this, but that needs 
resolving too

3) I hadn't checked whether worker et al had similar issues

4) I think that a simpler solution to this problem would be to add a 
timeout (e.g. 30s) on the child pollset_poll() call so they are 
guaranteed to pop at some point, but I haven't had time to try this or 
work out whether that will suck in some other way.

(Also, my stuff to dup2() against the listener was demonstrably useless, 
per the thread Jeff referenced, so that should definitely not be 
committed, regardless.)

Regards, Joe