You are viewing a plain text version of this content. The canonical link for it is here.
Posted to bugs@apr.apache.org by bu...@apache.org on 2009/10/20 19:46:41 UTC

DO NOT REPLY [Bug 48029] New: httpd-2.2.14 hangs in port_getn

https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

           Summary: httpd-2.2.14 hangs in port_getn
           Product: APR
           Version: 1.3.9
          Platform: Sun
        OS/Version: Solaris
            Status: NEW
          Severity: normal
          Priority: P2
         Component: APR
        AssignedTo: bugs@apr.apache.org
        ReportedBy: httpd.apache@jan-o-sch.net


running httpd on Solaris 10U7, we were facing problems somewhat like the ones
described in bug 47645, at least we saw processes hanging a *long* time in
port_getn once in a while. Although the conversation in that bug looks like
2.2.11 would not be affected, we decided to upgrade - and things are getting
better.

Better meaning the bug is rather easy to reproduce, now. We are using prefork
mpm. What I can reproduce is:
- start a fresh apache instance
- query some cgi

Results:
Some output from the cgi is received slowly, then it suddenly stops. The cgi
seems to have run properly: On the server, the corresponding process has
terminated, and can be seen as "defunct" with ps. Its parent is an httpd
process. pstack on that process shows:
 feef4ce5 portfs   (6, c, 8672428, 2, 1, 8045438)
 fecb0a77 call_port_getn (c, 8672428, 2, 804547c, 11e1a300, 0) + a5
 fecb1335 apr_pollset_poll (86723e8, 11e1a300, 0, 80454d8, 80454dc, 8672800) +
1d5
 081fb58d cgi_bucket_read (8669f70, 8045538, 804553c, 0, 8045520, fefd25eb) +
76
 080b0f86 ap_content_length_filter (866cca8, 8672798, 866bf14, 860c190,
866bf0c, 8669ec0) + bc
[...]

Using solaris' dtrace, we are seeing that the process called port_getn several
times earlier. System calls in chronological order, truncated after
ap_content_length_filter:

[1]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`apr_pollset_poll+0xe2
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 0

[2]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`call_port_getn+0xa5
              libapr-1.so.0.3.9`apr_pollset_poll+0x1d5
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 1

[3]
              libc.so.1`_write+0x15
              libapr-1.so.0.3.9`apr_file_write+0x1cc
              libapr-1.so.0.3.9`apr_file_write_full+0x38
              libapr-1.so.0.3.9`apr_file_puts+0x3b
              httpd`log_error_core+0x566
              httpd`ap_log_rerror+0x5b
              httpd`cgi_bucket_read+0x12f
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 131
corresponding log entry: (11)Resource temporarily unavailable: poll failed
waiting for CGI child

[4]
              libc.so.1`_writev+0x15
              libapr-1.so.0.3.9`apr_socket_sendv+0x83
              httpd`writev_it_all+0x5d
              httpd`ap_core_output_filter+0xb1a
              httpd`ap_pass_brigade+0x92
return value: 8183
obviously transfers the bit of output that is returned to the client

[5]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`call_port_getn+0xa5
              libapr-1.so.0.3.9`apr_pollset_poll+0x1d5
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 1

[6]
              libc.so.1`_read+0x15
              libapr-1.so.0.3.9`apr_file_read+0xff
              libapr-1.so.0.3.9`apr_file_gets+0x178
              httpd`log_script_err+0x26
              httpd`cgi_bucket_read+0x1a4
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 0
why 0? where has the rest of my cgi output gone?

[7]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`apr_pollset_remove+0x7b
              httpd`cgi_bucket_read+0x1c4
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
return value: 0

[8]
              libc.so.1`_portfs+0x15
              libapr-1.so.0.3.9`call_port_getn+0xa5
              libapr-1.so.0.3.9`apr_pollset_poll+0x1d5
              httpd`cgi_bucket_read+0x76
              httpd`ap_content_length_filter+0xbc
              httpd`ap_pass_brigade+0x92
-- hangs for quite a long time

This might also be a mod_cgi bug. However, mod_cgi.c has a timestamp of Jul
2007, and it worked with apr from httpd 2.2.11.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #6 from Jeff Trawick <tr...@apache.org> 2009-10-22 04:52:44 UTC ---
Hi Jan,

I don't have any issues patching...  What sort of errors are you getting?  Do
you have GNU patch (gpatch) available?

$ tar -xjf ../apr-1.3.9.tar.bz2 
$ cd apr-1.3.9/
$ patch -p0 --dry-run < ~/PR48029-1.txt 
patching file poll/unix/port.c
$ patch -p0  < ~/PR48029-1.txt 
patching file poll/unix/port.c

(I confirmed that what I can download from Bugzilla matches my original patch
file.)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

Markus Schiegl <ap...@schiegl.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |apache@schiegl.com

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

Bernd Leibing <be...@uni-ulm.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bernd.leibing@uni-ulm.de

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #9 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-22 06:39:39 UTC ---
Alright. Tried the patch now - but no differences.

We could write a more elaborate dtrace script, or try to investigate the
process waiting for port_getn() to complete with a debugger. But what are we
looking for?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #13 from Jeff Trawick <tr...@apache.org> 2009-10-24 04:49:25 UTC ---
Summary:
port_getn() returns bogus return codes with 32-bit applications on Solaris
10/x86.  APR 1.3.8 accidently managed to work in most cases; APR 1.3.9 treats
these bugs returns from port_getn() as errors since the value is negative.  The
latest Solaris 10 update ("Solaris 10 10/09" a.k.a. "Solaris 10 U8") has the
problem, as do some (all?) earlier releases.  Current OpenSolaris levels do not
have the problem.  It does not apparently affect 64-bit applications on x86 or
any applications on SPARC.  One interesting aspect of the port_getn() problem
is that the retcode is good if you control the process calling port_getn() (at
least with dbx or truss).

Anticipated fix:
Change APR configure to run a small test program and disable the use of Event
Ports if port_getn() returns a bogus value.

Workaround:
Set the environment variable ac_cv_func_port_create=no when building APR in an
affected configuration.

$ make distclean
$ ac_cv_func_port_create=no
$ export ac_cv_func_port_create
$ ./configure --regular-options

Here is a simple test program showing the issue:

$ cat useport.c 
#include <stdio.h>

#include <port.h>

int main(void)
{
    int port, rc;
    uint_t nget;

    port = port_create();

    rc = port_getn(port, NULL, 0, &nget, NULL);
    printf("rc %d nget %u\n", rc, nget);

    return 0;
}
$ cc useport.c && ./a.out 
rc -17349963 nget 0
$ cc -m64 useport.c && ./a.out 
rc 0 nget 0

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #3 from Jeff Trawick <tr...@apache.org> 2009-10-21 12:22:20 UTC ---
>Thus I guess port_getn is the high level name for _portfs.
>If that is correct, then your explanation with 11 and EAGAIN does not add up,
>as dtrace shows port_getn really returned 1.

If you look for _portfs at
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/event_port.c,
portfs is a generic port-related call (probably kernel entry point), and the
first parameter indicates which operation (which libc call).

I agree though that the source of the EAGAIN isn't necessarily what I said in
my previous update.  I think it is clear that apr_pollset_poll() returned
EAGAIN though, and that means either port_associate() or port_getn() returned
EAGAIN.  port_associate() even documents EAGAIN as a possible return.

The call to _portfs() directly from apr_pollset_poll() must be for
port_associate(), and the call to _portfs() from call_port_getn() must be for
port_getn().  If the EAGAIN came from port_associate(), we wouldn't have called
port_getn().  Are calls [1] and [2] from the same call to apr_pollset_poll()?

BTW, APR < 1.3.9 didn't check the return code from port_associate().

The return code from _portfs() is some structure/union rval_t which I can't
find the definition of.  It does include multiple values; I don't know which of
multiple values dtrace is showing us.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


[Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #18 from Jackie Rosen <ja...@hushmail.com> ---
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #14 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-26 09:44:37 UTC ---
First off: Compiling with ac_cv_func_port_create=no makes everything run fine,
it seems. That is a strong hint that the problem is located somewhere in the
near the port interface, either implementation or usage.

However, I cannot reproduce your useport example on any of the machines I use
to trigger the buggy behavior here. Can you prove it returns correct values
when trussed? If no, that should be a completely different issue - if yes, it
might be related, though.

Is the port thing restricted to solaris? Is it worth investigating more? Is it
an option to globally disable ports for everyone?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #10 from Jeff Trawick <tr...@apache.org> 2009-10-22 07:33:02 UTC ---
Something we need to know is whether there are any descriptors associated with
the event port when we block in port_getn(), but I don't know how to check for
that other than observing the sequence of calls to port_getn() and
port_associate() leading up to the hang.

Do you see clear evidence of where the EAGAIN is coming from in the traces you
already have?

I think it is easy to be 100% sure where the EAGAIN is coming from.

Since cgi_bucket_read logs the retcode from port_associate() or port_getn(),
let's change EAGAIN from either of those to something unique so that we know
where the EAGAIN is coming from.

--- poll/unix/port.c.orig    2009-10-22 10:15:14.172437555 -0400
+++ poll/unix/port.c    2009-10-22 10:24:37.096601717 -0400
@@ -350,6 +350,9 @@
                              fd, get_event(ep->pfd.reqevents), ep);
         if (ret < 0) {
             rv = apr_get_netos_error();
+
+            if (rv == EAGAIN) rv = ENOEXEC;
+
             APR_RING_INSERT_TAIL(&(pollset->free_ring), ep, pfd_elem_t, link);
             break;
         }
@@ -367,6 +370,8 @@
     rv = call_port_getn(pollset->port_fd, pollset->port_set, pollset->nalloc,
                         &nget, timeout);

+    if (rv == EAGAIN) rv = ERANGE;
+
     /* decrease the waiting ASAP to reduce the window for calling 
        port_associate within apr_pollset_add() */
     apr_atomic_dec32(&pollset->waiting);

So with this patch, what is the error code/error string logged with "poll
failed
waiting for CGI child"?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #11 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-23 03:23:25 UTC ---
Nice move, but bad luck. Retrun value remains EAGAIN (11) according to the log
file. Still, I wonder why everything works fine under truss.

I'll try to grab more information with dtrace. Or I might introduce some quick
and diry logging in port.c to see clearer. Whichever comes first...

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #7 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-22 05:32:43 UTC ---
*cough* I confused myself with the apr-1.x-PR47645.patch lying around.

Yes, the patch does apply now. Hoewever, from your previous explanations, I was
expecting to see some patch against port_associate() (due to EGAIN), while this
patch only changes port_getn() treatment.

Anyway, I'm going to try that now and come back then.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

Jeff Trawick <tr...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #4 from Jeff Trawick <tr...@apache.org> 2009-10-21 13:06:01 UTC ---
Created an attachment (id=24404)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=24404)
first test fix for PR 48029

Please try this patch.

The theory is that we didn't notice an event when it was first reported, so we
didn't know we had to reassociate the descriptor, so we never woke up when
activity occurred on that descriptor in the future.

This is what happened before 1.3.9 on some cases where -1/ETIME was returned by
port_getn().  The patch handles the same condition for -1/EAGAIN -- even an
event is returned, don't lose it.

This patch also includes the fix to PR48030, which is for a regression in this
code in APR 1.3.9.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #1 from Jeff Trawick <tr...@apache.org> 2009-10-21 11:30:34 UTC ---
This log message means port_getn() is returning EAGAIN:

(11)Resource temporarily unavailable: poll failed waiting for CGI child

EAGAIN isn't a documented return code for port_getn(), and APR doesn't handle
it.  (I see that libevent had to start handling it at one point.)

Since EAGAIN wasn't expected by APR, it gets passed back as-is to
cgi_bucket_read() as some arbitrary system error code.

Since EAGAIN wasn't expected by cgi_bucket_read(), it gets passed back as-is up
the filter chain as some arbitrary system error code.  Up the filter chain in
ap_content_length_filter(), EAGAIN is checked for (though it is supposed to
mean something else).  That causes the data read already to be flushed, then a
blocking bucket read is performed.

The read() which returns 0, where you ask "why 0? where has the rest of my cgi
output gone?" is a read on your script's stderr.  The apr_pollset_remove is
supposed to be for the stderr handle, since we've already hit EOF on it.

Does the final port_getn() hang for the amount specified by your Timeout
directive?  (300 is the default)

I'm wildly guessing for the moment that there's some bug with the pollset
manipulation such that the final port_getn() isn't looking at the right handle,
and the cgi process remains as a zombie until the pipe it shares with the httpd
child is emptied.  Maybe we'd avoid the mishandling except for the bogus
EAGAIN.

Can you get a truss showing the httpd child process and the cgi script?  (truss
-fp pid-of-httpd-blocked-in-poll before sending the request)

I'll try to model it more closely here.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #2 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-21 11:52:00 UTC ---
First: Yes, the amount of time passing until the request "completes" roughly
fits the config's Timeout value.

Unfortunately getting a truss of what is going on is not possible. I forget to
mention why we used dtrace to see what is going on - and that bit might be
interesting, too: When running with truss, everything works fine. Reproducibly.
Might be a hint to some kind of weird timing issue.

I was expecting the return value showed by dtrace for "libc.so.1`_portfs+0x15"
in [2] (which is 1), was the return value of port_getn.

When trussing the already hanging child process in [8], I see it sleeping in
port_getn. dtrace states it is in _portfs. pstack by the way also states
_portfs. Thus I guess port_getn is the high level name for _portfs.

If that is correct, then your explanation with 11 and EAGAIN does not add up,
as dtrace shows port_getn really returned 1. However, 1 would mean EPERM, which
is no documented return code for port_getn, either.

To gather more information, we could play around with dtrace somewhat further,
if we knew what we are looking for.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #12 from Jeff Trawick <tr...@apache.org> 2009-10-23 14:43:33 UTC ---
See recent updates added to 48030.

I'm thinking (bad idea?) that the root cause of Bernd's problem (in 48030) is
the same as this one, and Nick's problem (original report in 48029) is unique.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #16 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-27 11:52:14 UTC ---
> A little background:

Thanks for that part!

> Manwhile, I found a better solution besides disabling Event Ports which I've
> committed.  This is it:
> 
> http://svn.apache.org/viewvc/apr/apr/branches/1.3.x/poll/unix/port.c?r1=807269&r2=829803

Works! Seems we're on track now. However...

> >However, I cannot reproduce your useport example on any of the machines I use
> >to trigger the buggy behavior here. Can you prove it returns correct values
> >when trussed? If no, that should be a completely different issue - if yes, it
> >might be related, though.
> 
> You get "rc 0 nget 0" displayed on a machine with the problem?  Ouch.

... yes. That is still the case. But as the fix you committed works, there must
be some strange condition sometimes making port_getn return those negative
values aside from -1.

Had no time to play around with dtrace, yet. But it seems quite obvious it
would show port_getn returning a strange value as does your example (though not
reproducible here).

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #8 from Jeff Trawick <tr...@apache.org> 2009-10-22 06:12:02 UTC ---
>I was expecting to see some patch against port_associate() (due to EGAIN), 
>while this patch only changes port_getn() treatment.

The source of the EAGAIN issue isn't perfectly clear to me, but

* APR 1.3.9 won't call port_getn() if port_associate() returns any error,
including EAGAIN, within the same call to apr_pollset_poll(); that's why I
don't think port_associate() is the source of the EAGAIN

* if I'm wrong about the source of EAGAIN: the documented reason for getting
EAGAIN from port_associate() is reaching a limit of objects associated with the
port
** we can't recover from that error once it occurs (if indeed the reason for
EAGAIN is the documented reason)
** the bug reported in PR 48030, if it doesn't cause a crash, could cause
objects to be reassociated multiple times, potentially leading to the
documented scenario; the patch includes that fix

Needless to say, I wish I could reproduce the issue ;)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

Jeff Trawick <tr...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #17 from Jeff Trawick <tr...@apache.org> 2009-11-13 11:11:24 UTC ---
fix has been backported to APR 1.3.x branch for future 1.3.10

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #15 from Jeff Trawick <tr...@apache.org> 2009-10-26 10:39:51 UTC ---
>First off: Compiling with ac_cv_func_port_create=no makes everything run fine,
>it seems. That is a strong hint that the problem is located somewhere in the
>near the port interface, either implementation or usage.

>Is the port thing restricted to solaris? Is it worth investigating more? Is it
>an option to globally disable ports for everyone?

A little background:

apr_pollset_poll() is what APR provides to apps like httpd, and that works
cross-platform.  The "Event Port" stuff applies only to Solaris.  If Event
Ports were disabled, it wouldn't affect other platforms and it wouldn't affect
all applications on Solaris since plain poll() would be used.

Manwhile, I found a better solution besides disabling Event Ports which I've
committed.  This is it:

http://svn.apache.org/viewvc/apr/apr/branches/1.3.x/poll/unix/port.c?r1=807269&r2=829803

>However, I cannot reproduce your useport example on any of the machines I use
>to trigger the buggy behavior here. Can you prove it returns correct values
>when trussed? If no, that should be a completely different issue - if yes, it
>might be related, though.

You get "rc 0 nget 0" displayed on a machine with the problem?  Ouch.

Yes, truss or dbx makes that simple testcase run clean for me on Solaris
10/x86-32.

$ ./useport-32
rc -17349963 nget 0
$ truss ./useport-32 2>&1 | tail -5
brk(0x08062908)                                 = 0
fstat64(1, 0x08046FA0)                          = 0
rc 0 nget 0
write(1, " r c   0   n g e t   0\n", 12)        = 12
_exit(0)
$

similarities with your observations

a. got worse with APR 1.3.9

The huge negative retcode from port_getn() wouldn't cause a problem with APR
1.3.8, since 1.3.8 checked specifically for "rc == -1" instead of "rc < 0".

(The "got worse" idea assumes you had a different problem with 1.3.8, possibly
the one that 1.3.9 corrected, which was very intermittent.)

b. problem doesn't occur under observation via truss

c. unexpected EAGAIN failure

apr_pollset_poll() would grab whatever was in errno when it thought port_getn()
failed on these cases where port_getn() didn't really fail and didn't set
errno.  Since we're doing I/O with the CGI and recently did I/O with the client
, EAGAIN is a likely errno value to pick up incorrectly.

d. a case where port_getn() returns bogus negative number matches
cgi_bucket_read()'s usage

a case with the bad retcode is when port_getn() is called with 0 timeout to
find out immediately if an event is ready (if an event is actually ready it
won't return a bad retcode)

that's the kind of call made when cgi_bucket_read() is called the first time by
content-length filter; only when cgi_bucket_read() returns EAGAIN does
content-length filter tell it to wait until data is available

--/--

I hope you're able to try the tiny patch pointed to above in in this comment in
place of earlier attempts.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org


DO NOT REPLY [Bug 48029] httpd-2.2.14 hangs in port_getn

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

--- Comment #5 from Jan Schmidt <ht...@jan-o-sch.net> 2009-10-22 03:03:07 UTC ---
The patch you suggest does not apply to apr 1.3.9. I cannot figure out at first
sight what to change in 1.3.9 to resemble the changes you made (to the HEAD
version, I guess). Do you want me to try the repository version of apr? Or can
you provide a port to 1.3.9, please?

I do not see anything from the patch to bug 48030 included (which indeed does
apply to 1.3.9).

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@apr.apache.org
For additional commands, e-mail: bugs-help@apr.apache.org