You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Cliff Skolnick <cl...@steam.com> on 1999/05/13 12:24:46 UTC

Re: Proposal: Get rid of most accept mutex)calls on hybrid server.

I asked some folks from sun about this thread, the repsonse is below.

>On Tue, 11 May 1999, Tony Finch wrote:
>
>> Dean Gaudet <dg...@arctic.org> wrote:
>> >On Mon, 10 May 1999, Tony Finch wrote:
>> >> Dean Gaudet <dg...@arctic.org> wrote:
>> >> >
>> >> >Actually, I suspect that we don't really want to interprocess lock at all
>> >> >in the multithreaded server.  We use non-blocking listening sockets, and
>> >> >pay the wake-all cost for the small number of processes (we're talking
>> >> >like 8 processes, right?) 
>> >> 
>> >> If there's a select collision isn't *every* process woken up (not
>> >> just the httpds)?
>> >
>> >I'm not sure what you mean...  if a kernel had only one global sleeping
>> >queue, yeah... but then there'd be no way for us to avoid thundering herd,
>> >since everything would be awakened at all times.  But kernels typically
>> >have a lot of sleeping queues... including one on every socket.
>> 
>> Sorry, I was too terse. The BSD network stack only has space for
>> keeping track of one pid select()ing on each socket. If more than one
>> child select()s on a listen socket at the same time the kernel cannot
>> keep track of them so it doesn't even try: it just marks a collision
>> and when a connection arrives on that socket it wakes up every process
>> in select().
>
>Hmm, that sucks.  Linux keeps a list of all pids on each socket... it
>doesn't cost much, the select() code allocates a page of memory to store
>the wait list elements in.  For stuff other than select, the wait list
>elements are allocated on the kernel stack.  So even though there's a
>dynamic cost to allocating list elements, it's fairly cheap. 
>
>I wonder what solaris does (pretty much the only other platform I care
>about ;) 
>
>Dean

-- forward of a message from a kind person at sun --

There's no limit to the number of LWPs that can select() on a socket, it
should be noted that poll() is prefered as select() in Solaris is implemented
using poll() (i.e. the select() args are converted to pollfd_t's on the
stack (a 1024 element array of) the poll() is called then the results are
converted back to a select() mask).

The implemention of poll() changed in Solaris 7 as several apps (httpd,
database, ...) required the ability to poll() on many thousands of FDs,
prior to Solaris 7 it was a typical linked list of waiters per file_t
(and didn't scale well :(.

As of Solaris 7 a scheme refered to as /dev/poll was implemented such that
pollfd_t's are registered with the underlying FS (i.e. UFS, SOCKFS, ...)
and the FS does asynchronous notification. The end result is that poll()
now scales to tens of thousands of FDs per LWP (as well as a new API for
/dev/poll such that you open /dev/poll and do write()s (to register a number
of pollfd's) and read()s (to wait for, or in the case of nonblocking check
for, pollfd event(s)), using the /dev/poll API memory is your only limit
for scalability.



Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by Richard Gooch <rg...@atnf.csiro.au>.
Stephen C. Tweedie writes:
> Hi,
> 
> On Sat, 29 May 1999 07:51:25 +1000, Richard Gooch <rg...@atnf.csiro.au> said:
> 
> > Why not just increase the RT signal queue size? Add a command to
> > prctl(2) so the application can tune this.
> 
> Letting the application increase the queue size opens up a DOS attack:
> you can lock down arbitrarily much non-swappable memory with it.  root
> can already tune it via /proc/sys/kernel/rtsig-max.

OK, so there's a way for the administrator to tune this. I note the
default on my system is 1024, which should be fine for most
applications. However, for a busy WWW server, how deep should the
queue be? Anyone got a benchmark?

Assuming you need a much deeper queue (4096 for example), then tuning
rtsig-max will work, but at the cost of increasing the queue depth for
all processes, most of which won't need it. So how about creating
rtsig-limit and adding a command to prctl(2), so that a process which
needs to increase the queue depth can do so (up to the sysadmin tuned
limit), but other processes use the default?

				Regards,

					Richard....

Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by "Stephen C. Tweedie" <sc...@redhat.com>.
Hi,

On Sat, 29 May 1999 07:51:25 +1000, Richard Gooch <rg...@atnf.csiro.au> said:

> Why not just increase the RT signal queue size? Add a command to
> prctl(2) so the application can tune this.

Letting the application increase the queue size opens up a DOS attack:
you can lock down arbitrarily much non-swappable memory with it.  root
can already tune it via /proc/sys/kernel/rtsig-max.

--Stephen

Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by Richard Gooch <rg...@atnf.csiro.au>.
Stephen C. Tweedie writes:
> Hi,
> 
> On Fri, 14 May 1999 14:44:08 +0000, Dan Kegel <da...@alumni.caltech.edu>
> said:
> 
> > I have yet to use aio_ or F_SETSIG, but reading ready fd's from
> > /dev/poll makes more sense to me than listening for realtime signals
> > from aio_, which according to
> > http://www.deja.com/getdoc.xp?AN=366163395 can overflow, in which case
> > the kernel sends a SIGIO to say 'realtime signals overflowed, better
> > do a full poll'.  
> 
> Yes.
> 
> > I'm contemplating writing a server that uses aio_; that case kind of
> > defeats the purpose of using aio_, and handling it sounds annoying and
> > suboptimal.
> 
> It adds code complexity but it shouldn't hurt the normal case: you don't
> expect to get an overflow unless you have a _lot_ of traffic coming
> through, in which case the cost of an occasional poll() to clear the
> queue shouldn't make much odds.

Why not just increase the RT signal queue size? Add a command to
prctl(2) so the application can tune this.

				Regards,

					Richard....

Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by Ulrich Drepper <dr...@cygnus.com>.
"Stephen C. Tweedie" <sc...@redhat.com> writes:

> Amen to that --- we _need_ CLONE_SIGNALS for this.

This is one of the things, yes.  But it would be good to have a few
more things.

-- 
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com   `------------------------

Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by "Stephen C. Tweedie" <sc...@redhat.com>.
Hi,

On 28 May 1999 15:00:23 -0700, Ulrich Drepper <dr...@cygnus.com> said:

> "Stephen C. Tweedie" <sc...@redhat.com> writes:
>> However, it would be good to see real life profiling on this.

> Not without having a really good thread library first 

Amen to that --- we _need_ CLONE_SIGNALS for this.

> or at least optimizing the aio library.  

No, aio_* isn't used in this model: the reads and writes are
non-blocking already, it's just out-of-band activity indicators which we
need.  aio_* is only useful for IO which is otherwise blocking.

--Stephen

Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by Ulrich Drepper <dr...@cygnus.com>.
"Stephen C. Tweedie" <sc...@redhat.com> writes:

> However, it would be good to see real life profiling on this.

Not without having a really good thread library first or at least
optimizing the aio library.  The way I wrote it is *not* for optimized
performance, but instead for standard compliance.  I will sometime
soon write an optimized version of the library and before this
happened it's kind of pointless to compare the methods for the purpose
of making long-term decisions.

-- 
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com   `------------------------

Re: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by "Stephen C. Tweedie" <sc...@redhat.com>.
Hi,

On Fri, 14 May 1999 14:44:08 +0000, Dan Kegel <da...@alumni.caltech.edu>
said:

> I have yet to use aio_ or F_SETSIG, but reading ready fd's from
> /dev/poll makes more sense to me than listening for realtime signals
> from aio_, which according to
> http://www.deja.com/getdoc.xp?AN=366163395 can overflow, in which case
> the kernel sends a SIGIO to say 'realtime signals overflowed, better
> do a full poll'.  

Yes.

> I'm contemplating writing a server that uses aio_; that case kind of
> defeats the purpose of using aio_, and handling it sounds annoying and
> suboptimal.

It adds code complexity but it shouldn't hurt the normal case: you don't
expect to get an overflow unless you have a _lot_ of traffic coming
through, in which case the cost of an occasional poll() to clear the
queue shouldn't make much odds.

However, it would be good to see real life profiling on this.

--Stephen

/dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept mutex)calls on hybrid server.)

Posted by Dan Kegel <da...@alumni.caltech.edu>.
Dean Gaudet wrote:
> (A person at Sun wrote:)
> > As of Solaris 7 a scheme refered to as /dev/poll was implemented such that
> > pollfd_t's are registered with the underlying FS (i.e. UFS, SOCKFS, ...)
> > and the FS does asynchronous notification. The end result is that poll()
> > now scales to tens of thousands of FDs per LWP (as well as a new API for
> > /dev/poll such that you open /dev/poll and do write()s (to register a number
> > of pollfd's) and read()s (to wait for, or in the case of nonblocking check
> > for, pollfd event(s)), using the /dev/poll API memory is your only limit
> > for scalability.
>
> Now that's real nice.  I've been advocating this on linux kernel for a
> long time.  Say hello to completion ports the unix way.  I'm assuming they
> do the "right thing" and wake up in LIFO order, and allow you to read
> multiple events at once.

I have yet to use aio_ or F_SETSIG, but reading ready fd's from
/dev/poll
makes more sense to me than listening for realtime signals from aio_,
which according to http://www.deja.com/getdoc.xp?AN=366163395
can overflow, in which case the kernel sends a SIGIO to say 'realtime 
signals overflowed, better do a full poll'.  I'm contemplating writing
a server that uses aio_; that case kind of defeats the purpose of
using aio_, and handling it sounds annoying and suboptimal.

/dev/poll would never overflow in that way.

- Dan

Re: Proposal: Get rid of most accept mutex)calls on hybrid server.

Posted by Dean Gaudet <dg...@arctic.org>.
On Thu, 13 May 1999, Cliff Skolnick wrote:

> There's no limit to the number of LWPs that can select() on a socket, it
> should be noted that poll() is prefered as select() in Solaris is implemented
> using poll() (i.e. the select() args are converted to pollfd_t's on the
> stack (a 1024 element array of) the poll() is called then the results are
> converted back to a select() mask).

Yeah whenever we say "select" we really mean "select or poll as
appropriate" (at least that's what I mean -- because linux 2.2.x also
implements poll as the core primitive and converts select to poll... and
because poll really is the only logical alternative for thousands of fds). 

> The implemention of poll() changed in Solaris 7 as several apps (httpd,
> database, ...) required the ability to poll() on many thousands of FDs,
> prior to Solaris 7 it was a typical linked list of waiters per file_t
> (and didn't scale well :(.

Interesting... if there are n waiters waiting on f_1, f_2, ... f_n fds
respectively, it requires O(f_1 + f_2 + ... + f_n) time to tell the kernel
about the fds you're interested in... I'm not seeing a way to shorten that
which would make it worthwhile to change the linked list. 

> As of Solaris 7 a scheme refered to as /dev/poll was implemented such that
> pollfd_t's are registered with the underlying FS (i.e. UFS, SOCKFS, ...)
> and the FS does asynchronous notification. The end result is that poll()
> now scales to tens of thousands of FDs per LWP (as well as a new API for
> /dev/poll such that you open /dev/poll and do write()s (to register a number
> of pollfd's) and read()s (to wait for, or in the case of nonblocking check
> for, pollfd event(s)), using the /dev/poll API memory is your only limit
> for scalability.

Now that's real nice.  I've been advocating this on linux kernel for a
long time.  Say hello to completion ports the unix way.  I'm assuming they
do the "right thing" and wake up in LIFO order, and allow you to read
multiple events at once.

In case it wasn't obvious -- I planned the event thread to be customized
to each platform... it's where we get to take advantage of
platform-specific extensions (and quirks).

So far it sounds like only *BSD has problems with lots of processes
blocked in select() on the same socket.  We could use flock() locking
(with LOCK_NB) to arbitrate which event thread is handling the listening
sockets; it won't be any worse than what we have now, and will be lots
better in other ways. 

I really do think the architecture has all the right trade-offs: 

- We get to use threads for the protocol handling code, which is real nice
  because we don't have to build complex state engines to handle sockets
  blocking.  The common case is that all the protocol stuff is in the
  first packet.

- Module writers get to use threads, which gives them the same
  straightforward programming model.

- The people who whine about "why does apache not use select?" are
  appeased because on static-only servers we do use select for pretty
  much everything.

- We're set to handle hundreds upon hundreds of long haul slow clients
  downloading large responses.

Dean