You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Jeff Trawick <tr...@gmail.com> on 2009/01/06 21:32:39 UTC

Re: PR42829: graceful restart with multiple listeners using prefork MPM can result in hung processes

On Tue, Feb 5, 2008 at 7:53 AM, Joe Orton <jo...@redhat.com> wrote:

> On Fri, Feb 01, 2008 at 10:41:39AM +0100, Stefan Fritsch wrote:
> > Joe Orton wrote:
> > > I mentioned in the bug that the signal handler could cause undefined
> > > behaviour, but I'm not sure now whether that is true.  On Linux I can
> > > reproduce some cases where this will happen, which are all due to
> > > well-defined behaviour:
> > >
> > > 1) with some (default on Linux) accept mutex types,
> > > apr_proc_mutex_lock() will loop on EINTR.  Hence, children blocked
> > > waiting for the mutex do "hang" until the mutex is released.  Fixing
> > > this would need some APR work, new interfaces, blah
> >
> > This is not a problem. On graceful-stop or reload the processes will get
> > the lock one by one and die (or hang somewhere else). I have never seen a
> > left over process hanging in this function.
>
> Well, normally all children will be woken up and take the accept mutex
> because of the dummy connections.  But if you have one child blocked
> because of issue (3) - whilst holding the accept mutex - all the other
> children will also be blocked.  If the EINTR could be processed at MPM
> level, this wouldn't happen.  So I think it is a problem, though you
> could argue that solving (3) also sort of solves (1).
>
> > > I can also reproduce a third case, but I'm not sure about the cause:
> > >
> > > 3) apr_pollset_poll() is blocking despite the fact that the listening
> > > fds are supposedly already closed before entering the syscall.
> >
> > This is the main problem in my experience.
> ...
> > On Linux with epoll, the hanging processes just blocks in
> > apr_pollset_poll(), so checking the return value won't do any good.
> >
> > Maybe the problem is that (AIUI) poll() returns POLLNVAL if a fd is not
> > open, while epoll() does not have something similar. In epoll.c, a
> comment
> > says "APR_POLLNVAL is not handled by epoll". Or should epoll return
> > EPOLLHUP in this case?
>
> I did some more research on this: the case is covered in the epoll(7)
> man page - fds are removed from any containing epoll sets on closure.
> So it is well-defined behaviour, and the "hang" is expected; when all
> the listeners are closed, the poll set becomes empty, so the
> apr_pollset_poll() call will sleep forever, or until interrupted by
> signal!
>
> select() and poll() will indeed return POLLNVAL for the closed-fds case,
> and prefork needs to check for that.
>
> From some brief googling, FreeBSD kqueue appears to have the same
> guarantee.  This PR has some investigation of what happens with Solaris
> ports: http://issues.apache.org/bugzilla/show_bug.cgi?id=42580
>
> For the graceful-stop case, it would be simple enough to just signal any
> dozy children again to wake them up in the wait-for-exit loop, but
> graceful-restart doesn't have that opportunity, so I'm not sure about a
> general solution.  Reducing the poll timeout to some non-infinite time
> would work.


This holds up to some very light graceful-restart testing on OpenSolaris
(the same light testing that triggered a hang):

Index: server/mpm/prefork/prefork.c
===================================================================
--- server/mpm/prefork/prefork.c    (revision 731724)
+++ server/mpm/prefork/prefork.c    (working copy)
@@ -540,10 +540,12 @@
                 apr_int32_t numdesc;
                 const apr_pollfd_t *pdesc;

-                /* timeout == -1 == wait forever */
-                status = apr_pollset_poll(pollset, -1, &numdesc, &pdesc);
+                /* timeout == 10 seconds to avoid a hang at graceful
restart/stop
+                 * caused by the closing of sockets by the signal handler
+                 */
+                status = apr_pollset_poll(pollset, apr_time_from_sec(10),
&numdesc, &pdesc);
                 if (status != APR_SUCCESS) {
-                    if (APR_STATUS_IS_EINTR(status)) {
+                    if (APR_STATUS_IS_TIMEUP(status) ||
APR_STATUS_IS_EINTR(status)) {
                         if (one_process && shutdown_pending) {
                             return;
                         }