You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Manoj Kasichainula <ma...@io.com> on 1999/10/27 08:36:31 UTC

Various 2.0 bugs I'm looking at

If someone wants to fix them before I wake up, I'll be very happy.
I'm using the dexter MPM and ab to test on Linux, but I don't think
these are MPM-related.

I'm occasionally getting errors like:

[Wed Oct 27 02:21:36 1999] [warn] (9)Bad file descriptor: setsockopt: (TCP_NODELAY)
[Wed Oct 27 02:25:15 1999] [crit] [client 10.0.0.1] (9)Bad file descriptor: default_handler: mmap failed: /home/manoj/a2/htdocs/index.html

The first error seems to come from ap_get_os_sock() not getting a
valid socket (see process_socket in dexter.c), and the second seems to
come from ap_get_os_file not geting a valid fd (http_core.c). I don't
know if the two are related. I see the first error rarely (1-5 out of
10000 connections) but still more often than the second error.

I'm guessing that this next one is from me inserting a bug during my
buff work, but I haven't found the bug yet:

[Wed Oct 27 02:25:15 1999] [error] [client 10.0.0.1] Invalid method in request <html>

-- 
Manoj Kasichainula - manojk at io dot com - http://www.io.com/~manojk/

Re: Various 2.0 bugs I'm looking at

Posted by Manoj Kasichainula <ma...@io.com>.

On Mon, Nov 08, 1999 at 07:21:58AM -0500, Ryan Bloom wrote:
> Why is the Buff code ever trying to close sockets without using
> ap_close_socket?  The native types (e.g. ap_os_file_t) are only in Apache
> for a short time.  They AREN'T and SHOULDN't be staying in the Apache
> code.  In fact, I wish somebody would try to remove the rest of them ASAP.
> It is on my list of things I would like to do when I have time.

I've been more concerned with getting the server stable than APRizing
it. It's much easier to find bugs resulting from a change when there
aren't hordes of other bugs lurking around. Now, at least on Unix, the
server seems to be running quite well, so I think we can start
converting more code to the APR World Order.

> They
> were meant to be used so that Apache could talk to NON-APACHE modules.

High-performance MPMs will have to use native code. I don't see a way
for APR to portably export an interface that can optimally use NT's
asynchronous I/O APIs. The Linux SIGINFO stuff is probably easier, but
still difficult. And whatever problems there are in Apache right now
from the combination of APR and native code, I imagine they will show
up in external modules as well.

So, I think it's important that APR play well with native code.

> If that is the way they are going to be used, we really should
> reconsider APR at all, because we will forever be finding these
> kinds of bugs.

I think these sorts of bugs can be minimized if the portability code
exports as thin a layer as possible, meaning that it keeps as little
state as is possible on the various platforms, and if it gives the
programmer ultimate control over reasonable behaviors. For example, a
function to tell APR to relinquish ownership of a socket once the app
has called ap_get_os_sock() would have solved the double-close
problem just as well as the eventual APRization of the core code will.

-- 
Manoj Kasichainula - manojk at io dot com - http://www.io.com/~manojk/

Re: Various 2.0 bugs I'm looking at

Posted by Ryan Bloom <rb...@raleigh.ibm.com>.

I should have responded to this on Friday, but I was taking it easy and
just trying to get through 1500 messages.

Why is the Buff code ever trying to close sockets without using
ap_close_socket?  The native types (e.g. ap_os_file_t) are only in Apache
for a short time.  They AREN'T and SHOULDN't be staying in the Apache
code.  In fact, I wish somebody would try to remove the rest of them ASAP.
It is on my list of things I would like to do when I have time.  They
were meant to be used so that Apache could talk to NON-APACHE modules.
That is what they were designed for.  NOT so that parts of Apache could
not have to use APR.  If that is the way they are going to be used, we
really should reconsider APR at all, because we will forever be finding
these kinds of bugs.

Ryan

> OK, the problem was that APR and buff were both trying to close the
> same socket. Under heavy load, buff would close the socket, the socket
> would get reassigned to something else (a file, leading to some of the
> errors, or a socket, leading to the rest), then APR would close the
> socket that didn't belong to it. That last step is the new piece that
> got trigged by the patch I referenced above. I've put in a hack for
> now (preventing IOL from actually closing the socket) on Unix.
> 
> The root problem is the switching back and forth between APR and OS
> sockets. APR thinks it owns the socket and is responsible for closing
> it, and so does buff. The best solution for MPMs that don't get
> APRized is probably not to use the ap_accept call, but to use the OS
> version instead. But, the bug also naturally goes away on platforms
> that have an APR-based iol_socket.
> 
> -- 
> Manoj Kasichainula - manojk at io dot com - http://www.io.com/~manojk/
> 

_______________________________________________________________________
Ryan Bloom		rbb@raleigh.ibm.com
4205 S Miami Blvd	
RTP, NC 27709		It's a beautiful sight to see good dancers 
			doing simple steps.  It's a painful sight to
			see beginners doing complicated patterns.

Re: Various 2.0 bugs I'm looking at

Posted by Dean Gaudet <dg...@arctic.org>.

note... iol was something i did because i needed layering but wasn't ready
for APR.  you may want to revisit that now -- perhaps APR should give the
layering.  (NSPR provides layering for example... i keep coming back to
NSPR 'cause they did a lot right.)

Dean

On Fri, 29 Oct 1999, Manoj Kasichainula wrote:

> On Thu, Oct 28, 1999 at 06:30:55PM -0400, Me at IBM wrote:
> > I think I've narrowed down the problem to Brian Havard's patch to add
> > a context field to ap_accept. But, I see nothing in that patch that
> > should break anything.
> 
> OK, the problem was that APR and buff were both trying to close the
> same socket. Under heavy load, buff would close the socket, the socket
> would get reassigned to something else (a file, leading to some of the
> errors, or a socket, leading to the rest), then APR would close the
> socket that didn't belong to it. That last step is the new piece that
> got trigged by the patch I referenced above. I've put in a hack for
> now (preventing IOL from actually closing the socket) on Unix.
> 
> The root problem is the switching back and forth between APR and OS
> sockets. APR thinks it owns the socket and is responsible for closing
> it, and so does buff. The best solution for MPMs that don't get
> APRized is probably not to use the ap_accept call, but to use the OS
> version instead. But, the bug also naturally goes away on platforms
> that have an APR-based iol_socket.
> 
> -- 
> Manoj Kasichainula - manojk at io dot com - http://www.io.com/~manojk/
>

Re: Various 2.0 bugs I'm looking at

Posted by Brian Havard <br...@kheldar.apana.org.au>.

On Fri, 29 Oct 1999 19:43:41 -0500, Manoj Kasichainula wrote:

>OK, the problem was that APR and buff were both trying to close the
>same socket. Under heavy load, buff would close the socket, the socket
>would get reassigned to something else (a file, leading to some of the
>errors, or a socket, leading to the rest), then APR would close the
>socket that didn't belong to it. That last step is the new piece that
>got trigged by the patch I referenced above. I've put in a hack for
>now (preventing IOL from actually closing the socket) on Unix.
>
>The root problem is the switching back and forth between APR and OS
>sockets. APR thinks it owns the socket and is responsible for closing
>it, and so does buff. The best solution for MPMs that don't get
>APRized is probably not to use the ap_accept call, but to use the OS
>version instead. But, the bug also naturally goes away on platforms
>that have an APR-based iol_socket.

Well, I just APRized the OS/2 iol_socket code and as far as I can tell the
same code should work on all platforms. I've hit it pretty hard (ab -c 50 -n
100000 ...) and it held up just fine.

-- 
 ______________________________________________________________________________
 |  Brian Havard                 |  "He is not the messiah!                   |
 |  brianh@kheldar.apana.org.au  |  He's a very naughty boy!" - Life of Brian |
 ------------------------------------------------------------------------------

Re: Various 2.0 bugs I'm looking at

Posted by Manoj Kasichainula <ma...@io.com>.

On Thu, Oct 28, 1999 at 06:30:55PM -0400, Me at IBM wrote:
> I think I've narrowed down the problem to Brian Havard's patch to add
> a context field to ap_accept. But, I see nothing in that patch that
> should break anything.

OK, the problem was that APR and buff were both trying to close the
same socket. Under heavy load, buff would close the socket, the socket
would get reassigned to something else (a file, leading to some of the
errors, or a socket, leading to the rest), then APR would close the
socket that didn't belong to it. That last step is the new piece that
got trigged by the patch I referenced above. I've put in a hack for
now (preventing IOL from actually closing the socket) on Unix.

The root problem is the switching back and forth between APR and OS
sockets. APR thinks it owns the socket and is responsible for closing
it, and so does buff. The best solution for MPMs that don't get
APRized is probably not to use the ap_accept call, but to use the OS
version instead. But, the bug also naturally goes away on platforms
that have an APR-based iol_socket.

-- 
Manoj Kasichainula - manojk at io dot com - http://www.io.com/~manojk/

Re: Various 2.0 bugs I'm looking at

Posted by Manoj Kasichainula <ma...@raleigh.ibm.com>.

On Wed, Oct 27, 1999 at 01:36:31AM -0500, Me at IO wrote:
> If someone wants to fix them before I wake up, I'll be very happy.
> I'm using the dexter MPM and ab to test on Linux, but I don't think
> these are MPM-related.

I think I've narrowed down the problem to Brian Havard's patch to add
a context field to ap_accept. But, I see nothing in that patch that
should break anything. (Can anyone?)

All I can think of is that the pools are subtly hosed, and that adding
this new context triggered a latent locking bug. The pool code is from
1.3 with mods, not the code from the apache-apr tree which seemed rock
solid. I may try turning the APR pools into the pools from the old
hybrid server and see if my bugs go away.

-- 
Manoj Kasichainula - manojk@raleigh.ibm.com
IBM, Apache Development