You are viewing a plain text version of this content. The canonical link for it is here.
Posted to bugs@httpd.apache.org by bu...@apache.org on 2009/08/05 12:12:33 UTC

DO NOT REPLY [Bug 47645] New: httpd-2.2.12 often hangs for hours

https://issues.apache.org/bugzilla/show_bug.cgi?id=47645

           Summary: httpd-2.2.12 often hangs for hours
           Product: Apache httpd-2
           Version: 2.2.12
          Platform: Sun
        OS/Version: Solaris
            Status: NEW
          Severity: critical
          Priority: P2
         Component: All
        AssignedTo: bugs@httpd.apache.org
        ReportedBy: strube@physik3.gwdg.de


--- Comment #0 from HWS <st...@physik3.gwdg.de> 2009-08-05 03:12:31 PDT ---
After upgrading from httpd 2.2.11 to 2.2.12, the server was blocked at least
once per day for up to six hours. In this state, neither any answer packets
were generated (tested with network sniffer) nor any log lines written. In one
such state, I restarted the server gracefully, then it worked again. The log
files do not show any errors. Tracing the running master httpd process with
Solaris "truss" showed a loop of pollsys(...) and waitid(P_ALL,...) calls, but
this is normal. Only downgrading to 2.2.11 helped.

Replacing the log-pipe syntax "|exec cronolog ..." with the new one,
"||cronolog ..." had no effect. Systematic testing will be difficult since I
cannot do this on our "production" web server and the bug cannot be triggered
at will.

System: Solaris 10 u7 (SPARC), httpd self-compiled with Sun Studio Forte 7
compiler, configure options: CFLAGS='-xO3 -xarch=v8 -xstrconst'
--disable-threads --enable-suexec --with-ssl=/usr/sfw (plus module options;
mpm=prefork [default]). Additionally, mod_fastcgi was compiled separately. All
this and the runtime configuration were identical to previous versions, which
always worked reliably.

Process hierarchy (log output is piped through cronolog):
httpd        (master)
  cronolog
  cronolog
  cronolog
  httpd      (fastcgi controller)
    php-fcgi (master, PHP_FCGI_CHILDREN=2)
      php-fcgi
      php-fcgi
  httpd
  httpd
  ...
  httpd

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #9 from HWS <st...@physik3.gwdg.de> 2009-08-10 07:44:54 PDT ---
>You will likely be able to work around the problem by selecting a different
>poll implementation.

For testing, I compiled two new variants of libapr-1.so.0.3.7. First, I removed
the line "#define HAVE_PORT_CREATE 1" from
srclib/apr/include/arch/unix/apr_private.h, causing poll() instead of
port_create() to be used (I verified that no reference to port_create() existed
any more in the library). With this library, no hangs happened in 3,5 days.
Second, I used port_create() but removed the two occurences of fcntl(fd,
F_SETFD, flags) from srclib/apr/poll/unix/port.c, making it act as in
httpd-2.2.11. But in spite of this, there soon were hangs again.

Thus, NOT using port_create() seems to be a workaround. But still, I am not
content with this, because 2.2.11 also used port_create() without ANY
difficulties! (This is why I also tested the effect of F_SETFD in poll.c, which
was new since 2.2.11. But there is F_SETFD in many other places, too.) The
actual reason of the hanging is, IMHO, still unknown (some subtle,
machine-dependent timing problem?).

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #6 from Jeff Trawick <tr...@apache.org> 2009-08-06 05:44:18 PDT ---
BTW, what exactly did you do to clear up the hang?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #3 from HWS <st...@physik3.gwdg.de> 2009-08-06 05:07:05 PDT ---
Faster than expected, the bug occurred again, and I could collect some
hopefully informative output while the server was hanging:

*** master httpd process:
* pstack PID :
 fefcc1d0 pollsys  (ffbffb30, 0, ffbffb98, 0)
 fef67b68 pselect  (ffbffb30, ff034630, ff034630, 0, ffbffb98, 0) + 1c8
 fef67ee0 select   (0, 0, 0, 0, ffbffc00, 4ab82600) + a0
 ff21faf0 apr_sleep (0, f4240, f4240, 33c, 9c180, f4000) + 4c
 0004398c ap_wait_or_timeout (11175, 11000, ffbffcd0, ffffffff, 3, f4000) + 7c
 0006ab00 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 2f0
 0002787c main     (1, 9957c, 99400, 9a400, 99510, cb188) + 994
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000)  = 0
waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000) (sleeping...)
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000)  = 0
waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0
pollsys(0xFFBFFB30, 0, 0xFFBFFB98, 0x00000000)  = 0
waitid(P_ALL, 0, 0xFFBFFAB8, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0
... etc.

* truss -t !all -u a.out: -p PID :
  (No other process showed any output on this!)
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0,
0xffffffff)
/1@1:       <- ap_wait_or_timeout() = 0xffbffce4
/1@1:       -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18)
/1@1:       <- ap_wait_or_timeout() = 0x11175
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0,
0xffffffff)
/1@1:       <- perform_idle_server_maintenance() = 1
/1@1:       -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18)
/1@1:         -> ap_run_monitor(0xa6e18, 0x0, 0x0, 0x0)
/1@1:         <- ap_run_monitor() = 0
/1@1:       <- ap_wait_or_timeout() = 0x11175
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0,
0xffffffff)
/1@1:       <- perform_idle_server_maintenance() = 1
/1@1:       -> ap_wait_or_timeout(0xffbffce4, 0xffbffce0, 0xffbffcd0, 0xa6e18)
/1@1:       <- ap_wait_or_timeout() = 0x11175
/1@1:       -> perform_idle_server_maintenance(0xa6e18, 0x11000, 0xffbffcd0,
0xffffffff)
... etc.

*** fcgi-controlling httpd process:
* pstack PID :
 fefcc1d0 pollsys  (ffbff9b0, 1, 0, 0)
 fef67b68 pselect  (ffbff9b0, ff034630, ff034630, 40, 0, 0) + 1c8
 fef67ee0 select   (10, ffbffba0, 0, 0, 0, ffbffc29) + a0
 feb99a9c fcgi_pm_main (180, 0, febb6b80, feba005c, feba00e8, 3) + 9a4
 feb943d4 init_module (a6e18, 18f6a8, aae28, ada58, 9782c, 0) + 264
 0003abd4 ap_run_post_config (a6e18, e4f10, aae28, ada58, 9, 1c6908) + 48
 00027820 main     (1, 9957c, 99400, 9a400, 99510, cb188) + 938
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
pollsys(0xFFBFF9B0, 1, 0x00000000, 0x00000000) (sleeping...)

*** one worker httpd process:
* pstack PID :
 fefcb904 portfs   (6, 8, 1ed548, 2, 1, ffbffa80)
 ff21a994 apr_pollset_poll (1ed508, 0, 989680, ffbffaf8, ffbffaf4, 11177) + 12c
 0006a218 child_main (18, 989400, 9a400, 1f34d0, 1, 9c400) + 378
 0006a4d8 make_child (9a400, d, 9ac00, 0, 9ac00, e) + 128
 0006a6f0 perform_idle_server_maintenance (2, ffbffbec, 9ad7c, 1, 2, 9c400) +
164
 0006ac18 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 408
 0002787c main     (1, 9957c, 99400, 9a400, 99510, cb188) + 994
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) (sleeping...)
port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80)      = 0 [62]
port_getn(8, 0x001ED548, 2, 1, 0xFFBFFA80) (sleeping...)

*** all other worker httpd processes (adresses varying):
* pstack PID :
 fefcb804 fcntl    (f, 7, ff232d2c)
 fefbb4cc fcntl    (f, 1, ff332a00, ff213f24, ffffedf6, 49db8400) + 84
 ff212610 proc_mutex_fcntl_acquire (33c, 0, 1f3f8, fffb7d40, ff2319d4, 18f7f0)
+ 44
 000698cc accept_mutex_on (8, 0, 0, 0, 8, 9c000) + 8
 0006a168 child_main (18, 3e8, 9a400, 1f34d0, 1, 9c400) + 2c8
 0006a4d8 make_child (9a400, c, 9ac00, 0, 9ac00, d) + 128
 0006a6f0 perform_idle_server_maintenance (2, ffbffbe8, 9ad7c, 0, 2, 9c400) +
164
 0006ac18 ap_mpm_run (a, ff220c9c, 0, 0, 9ad84, 9c030) + 408
 0002787c main     (1, 9957c, 99400, 9a400, 99510, cb188) + 994
 00026688 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
fcntl(15, F_SETLKW, 0xFF232D2C) (sleeping...)

*** all cronolog processes (adresses varying):
* pstack PID :
 ff2cc2e4 read     (0, ffbdfd5c, 10000)
 00011c74 main     (26c00, 4a7aac78, 25dd0, 3, 25800, ffbffe8e) + 790
 000113dc _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
read(0, 0xFFBDFD5C, 65536)    (sleeping...)

*** master php process:
* pstack PID :
 feb4c91c waitid   (7, 0, ffbfdbd0, 3)
 feb3c120 wait     (ffbffe38, 1, ff2c2a00, 2, 570cf4, 0) + 58
 0047182c main     (7fa0, 580f20, 0, 1, 58103c, 1) + 9cc
 00042fa8 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
waitid(P_ALL, 0, 0xFFBFDBD0, WEXITED|WTRAPPED) (sleeping...)

*** child php processes:
* pstack PID :
 feb4b97c accept   (0, ffbfdc3c, ffbfdcac, 1)
 00472794 main     (8014, 0, 1f4, 1, 0, 1) + 1934
 00042fa8 _start   (0, 0, 0, 0, 0, 0) + 108

* truss -p PID :
accept(0, 0xFFBFDC3C, 0xFFBFDCAC, SOV_DEFAULT) (sleeping...)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #1 from Jeff Trawick <tr...@apache.org> 2009-08-05 06:24:25 PDT ---
Please use pstack to see what the httpd child processes are doing and post
unique backtraces here.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645


Jeff Trawick <tr...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO


--- Comment #4 from Jeff Trawick <tr...@apache.org> 2009-08-06 05:17:54 PDT ---
>Please grant me some time (up to a week) for my next information.

No problem.  This is best effort/as-time-allows support on both sides ;)  I'll
just mark the status as NEEDINFO, which you can change when you're able to
duplicate the problem and post the backtraces.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645


Jeff Trawick <tr...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|All                         |APR
            Version|2.2.12                      |1.3.4
         AssignedTo|bugs@httpd.apache.org       |bugs@apr.apache.org
            Product|Apache httpd-2              |APR


--- Comment #11 from Jeff Trawick <tr...@apache.org> 2009-08-19 11:33:44 PDT ---
I was able to reproduce the event port problem then run successfully with this
patch, which applies cleanly to the APR in httpd 2.2.12 or 2.2.13.  (It isn't
expected to be needed for httpd older than 2.2.12, unless a certain prefork MPM
fix has been applied on top of 2.2.11).

----------cut here-----------
--- poll/unix/port.c.orig    2009-08-19 13:48:22.876217458 -0400
+++ poll/unix/port.c    2009-08-19 13:49:49.908294412 -0400
@@ -320,6 +320,17 @@
     ret = port_getn(pollset->port_fd, pollset->port_set, pollset->nalloc,
                     &nget, tvptr);

+    if (ret == -1 && nget > 0) {
+        /* On at least some levels of Solaris, -1/ETIME can be
+         * reported at the same time an event is returned.
+         * Since the kernel forgets the event condition as soon as
+         * it is returned, we have to ignore the -1/ETIME and
+         * process the event to prevent a deadlock.
+         */
+
+        ret = 0;
+    }
+
     /* decrease the waiting ASAP to reduce the window for calling 
        port_associate within apr_pollset_add() */
     apr_atomic_dec32(&pollset->waiting);
------------cut here-----------

If anyone tries this patch, please report back.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #7 from HWS <st...@physik3.gwdg.de> 2009-08-06 06:43:46 PDT ---
Before recompiling and doing new tests (which I have to delay till after the
next hang), here some remarks:

> I had a similar situation
>with the event MPM once and this same implementation of apr_pollset_poll(),
>but I didn't have time to debug or to narrow it down to one of {Event MPM,
>APR, Solaris kernel}.
>You will likely be able to work around the problem by selecting a different
>poll implementation.

As far as I can see, 2.2.11 used exactly the same implementation, without any
difficulties. The only change now is setting FD_CLOEXEC.

>The TCP layer should answer a SYN packet.  Did that happen?
>What does netstat show at the time of the hang?  If connections are able to
>reach the ESTABLISHED state but not responded to, then TCP is doing the right
>thing and httpd isn't finding the connection.  If none reach the ESTABLISHED
>state, then it is a TCP layer issue.

I have to wait for the next hang, but as far as I remember, snoop did not show
any 'A' packets as reply to 'S' packets. If I remember the netstat output
correctly, old connections were visible, but no new ones.

>BTW, what exactly did you do to clear up the hang?

Often nothing, but then it took hours to recover. Graceful restart always
cleared the hang.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #8 from HWS <st...@physik3.gwdg.de> 2009-08-06 08:16:24 PDT ---
>I have to wait for the next hang, but as far as I remember, snoop did not show
>any 'A' packets as reply to 'S' packets. If I remember the netstat output
>correctly, old connections were visible, but no new ones.

No, my memory seems to be bad! Now a new hang occurred, but new connection
attempts do some handshake, becoming ESTABLISHED. Also netstat shows a lot of
connections in CLOSE_WAIT state, which vanished after graceful restart (done in
order to revive the server).

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #2 from HWS <st...@physik3.gwdg.de> 2009-08-06 00:37:31 PDT ---
I used a duplicate installation on another machine (same OS) for testing and a
script executing wget queries to six web-pages (with subqueries to css, images
etc.) at a high rate (about 100 times the average load of our web server). The
bug did not occur within 16 hours! However, the machines (both SPARC) had
different type and number of CPUs. Thus I will have to reinstall 2.2.12 on our
web server and wait till it hangs (although I don't like this).
Please grant me some time (up to a week) for my next information.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645


Jeff Trawick <tr...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW


--- Comment #5 from Jeff Trawick <tr...@apache.org> 2009-08-06 05:39:11 PDT ---
This pid is supposed to wake up for the next client connection:

* pstack PID :
 fefcb904 portfs   (6, 8, 1ed548, 2, 1, ffbffa80)
 ff21a994 apr_pollset_poll (1ed508, 0, 989680, ffbffaf8, ffbffaf4, 11177) + 12c
 0006a218 child_main (18, 989400, 9a400, 1f34d0, 1, 9c400) + 378

>From the truss I guess it is waking up every 10 seconds or so (whatever the
prefork timeout is), but never finding a new connection.

I don't know exactly what is causing the problem.  I had a similar situation
with the event MPM once and this same implementation of apr_pollset_poll(), but
I didn't have time to debug or to narrow it down to one of {Event MPM, APR,
Solaris kernel}.

You will likely be able to work around the problem by selecting a different
poll implementation.  It should be as easy as 

$ make distclean
$ export ac_cv_func_port_create=no
$ ./configure [normal configure options]

(Does that look right to others following the bug list?)

>From another standpoint:

>In this state, neither any answer packets were generated (tested with network sniffer) nor any log lines written.

The TCP layer should answer a SYN packet.  Did that happen?

What does netstat show at the time of the hang?  If connections are able to
reach the ESTABLISHED state but not responded to, then TCP is doing the right
thing and httpd isn't finding the connection.  If none reach the ESTABLISHED
state, then it is a TCP layer issue.

What changed in 2.2.12?

The Solaris port code changed in APR, but it doesn't look dangerous:

+    {
+        int flags;
+
+        if ((flags = fcntl((*pollset)->port_fd, F_GETFD)) == -1)
+            return errno;
+
+        flags |= FD_CLOEXEC;
+        if (fcntl((*pollset)->port_fd, F_SETFD, flags) == -1)
+            return errno;
+    }

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


DO NOT REPLY [Bug 47645] httpd-2.2.12 often hangs for hours

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47645



--- Comment #10 from Jeff Trawick <tr...@apache.org> 2009-08-11 12:29:11 PDT ---
Thanks so much for the valuable research.

I agree that disabling port_create() is not a satisfying work-around; there's
no question that something needs to be fixed, but I don't know where (kernel,
APR, httpd, or even compiler).  It's more than a little astonishing that you're
encountering this with the simplest MPM, with APR/httpd threading support
disabled.

I guess that the problem has been there all along but is triggered by some
seemingly unrelated change in 2.2.12.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org