You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Joe Orton <jo...@redhat.com> on 2008/02/05 13:53:00 UTC

Re: PR42829: graceful restart with multiple listeners using prefork MPM can result in hung processes

On Fri, Feb 01, 2008 at 10:41:39AM +0100, Stefan Fritsch wrote:
> Joe Orton wrote:
> > I mentioned in the bug that the signal handler could cause undefined
> > behaviour, but I'm not sure now whether that is true.  On Linux I can
> > reproduce some cases where this will happen, which are all due to
> > well-defined behaviour:
> >
> > 1) with some (default on Linux) accept mutex types,
> > apr_proc_mutex_lock() will loop on EINTR.  Hence, children blocked
> > waiting for the mutex do "hang" until the mutex is released.  Fixing
> > this would need some APR work, new interfaces, blah
> 
> This is not a problem. On graceful-stop or reload the processes will get
> the lock one by one and die (or hang somewhere else). I have never seen a
> left over process hanging in this function.

Well, normally all children will be woken up and take the accept mutex 
because of the dummy connections.  But if you have one child blocked 
because of issue (3) - whilst holding the accept mutex - all the other 
children will also be blocked.  If the EINTR could be processed at MPM 
level, this wouldn't happen.  So I think it is a problem, though you 
could argue that solving (3) also sort of solves (1).

> > I can also reproduce a third case, but I'm not sure about the cause:
> >
> > 3) apr_pollset_poll() is blocking despite the fact that the listening
> > fds are supposedly already closed before entering the syscall.
> 
> This is the main problem in my experience.
...
> On Linux with epoll, the hanging processes just blocks in
> apr_pollset_poll(), so checking the return value won't do any good.
> 
> Maybe the problem is that (AIUI) poll() returns POLLNVAL if a fd is not
> open, while epoll() does not have something similar. In epoll.c, a comment
> says "APR_POLLNVAL is not handled by epoll". Or should epoll return
> EPOLLHUP in this case?

I did some more research on this: the case is covered in the epoll(7) 
man page - fds are removed from any containing epoll sets on closure.  
So it is well-defined behaviour, and the "hang" is expected; when all 
the listeners are closed, the poll set becomes empty, so the 
apr_pollset_poll() call will sleep forever, or until interrupted by 
signal!

select() and poll() will indeed return POLLNVAL for the closed-fds case, 
and prefork needs to check for that.

>From some brief googling, FreeBSD kqueue appears to have the same 
guarantee.  This PR has some investigation of what happens with Solaris 
ports: http://issues.apache.org/bugzilla/show_bug.cgi?id=42580

For the graceful-stop case, it would be simple enough to just signal any 
dozy children again to wake them up in the wait-for-exit loop, but 
graceful-restart doesn't have that opportunity, so I'm not sure about a 
general solution.  Reducing the poll timeout to some non-infinite time 
would work.

joe

Re: PR42829: graceful restart with multiple listeners using prefork MPM can result in hung processes

Posted by Jeff Trawick <tr...@gmail.com>.
On Tue, Feb 5, 2008 at 7:53 AM, Joe Orton <jo...@redhat.com> wrote:

> On Fri, Feb 01, 2008 at 10:41:39AM +0100, Stefan Fritsch wrote:
> > Joe Orton wrote:
> > > I mentioned in the bug that the signal handler could cause undefined
> > > behaviour, but I'm not sure now whether that is true.  On Linux I can
> > > reproduce some cases where this will happen, which are all due to
> > > well-defined behaviour:
> > >
> > > 1) with some (default on Linux) accept mutex types,
> > > apr_proc_mutex_lock() will loop on EINTR.  Hence, children blocked
> > > waiting for the mutex do "hang" until the mutex is released.  Fixing
> > > this would need some APR work, new interfaces, blah
> >
> > This is not a problem. On graceful-stop or reload the processes will get
> > the lock one by one and die (or hang somewhere else). I have never seen a
> > left over process hanging in this function.
>
> Well, normally all children will be woken up and take the accept mutex
> because of the dummy connections.  But if you have one child blocked
> because of issue (3) - whilst holding the accept mutex - all the other
> children will also be blocked.  If the EINTR could be processed at MPM
> level, this wouldn't happen.  So I think it is a problem, though you
> could argue that solving (3) also sort of solves (1).
>
> > > I can also reproduce a third case, but I'm not sure about the cause:
> > >
> > > 3) apr_pollset_poll() is blocking despite the fact that the listening
> > > fds are supposedly already closed before entering the syscall.
> >
> > This is the main problem in my experience.
> ...
> > On Linux with epoll, the hanging processes just blocks in
> > apr_pollset_poll(), so checking the return value won't do any good.
> >
> > Maybe the problem is that (AIUI) poll() returns POLLNVAL if a fd is not
> > open, while epoll() does not have something similar. In epoll.c, a
> comment
> > says "APR_POLLNVAL is not handled by epoll". Or should epoll return
> > EPOLLHUP in this case?
>
> I did some more research on this: the case is covered in the epoll(7)
> man page - fds are removed from any containing epoll sets on closure.
> So it is well-defined behaviour, and the "hang" is expected; when all
> the listeners are closed, the poll set becomes empty, so the
> apr_pollset_poll() call will sleep forever, or until interrupted by
> signal!
>
> select() and poll() will indeed return POLLNVAL for the closed-fds case,
> and prefork needs to check for that.
>
> From some brief googling, FreeBSD kqueue appears to have the same
> guarantee.  This PR has some investigation of what happens with Solaris
> ports: http://issues.apache.org/bugzilla/show_bug.cgi?id=42580
>
> For the graceful-stop case, it would be simple enough to just signal any
> dozy children again to wake them up in the wait-for-exit loop, but
> graceful-restart doesn't have that opportunity, so I'm not sure about a
> general solution.  Reducing the poll timeout to some non-infinite time
> would work.


This holds up to some very light graceful-restart testing on OpenSolaris
(the same light testing that triggered a hang):

Index: server/mpm/prefork/prefork.c
===================================================================
--- server/mpm/prefork/prefork.c    (revision 731724)
+++ server/mpm/prefork/prefork.c    (working copy)
@@ -540,10 +540,12 @@
                 apr_int32_t numdesc;
                 const apr_pollfd_t *pdesc;

-                /* timeout == -1 == wait forever */
-                status = apr_pollset_poll(pollset, -1, &numdesc, &pdesc);
+                /* timeout == 10 seconds to avoid a hang at graceful
restart/stop
+                 * caused by the closing of sockets by the signal handler
+                 */
+                status = apr_pollset_poll(pollset, apr_time_from_sec(10),
&numdesc, &pdesc);
                 if (status != APR_SUCCESS) {
-                    if (APR_STATUS_IS_EINTR(status)) {
+                    if (APR_STATUS_IS_TIMEUP(status) ||
APR_STATUS_IS_EINTR(status)) {
                         if (one_process && shutdown_pending) {
                             return;
                         }