You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Brian Pane <br...@cnet.com> on 2002/11/24 03:40:58 UTC

request for comments: multiple-connections-per-thread MPM design

Here's an outline of my latest thinking on how to build a
multiple-connections-per-thread MPM for Apache 2.2.  I'm
eager to hear feedback from others who have been researching
this topic.

Thanks,
Brian


Overview
--------
The design described here is a hybrid sync/async architecture:

* Do the slow part of request processing--network reads and
  writes--in an event loop for scalability.

* Do the fast part of request processing--everything other
  than network I/O--in a one-request-per-thread mode so that
  module developers don't have to rewrite all their code as
  reentrant state machines.


Basic structure
---------------

Each httpd child process has four thread pools:

1. Listener thread
      A Listener thread accept(2)s a connection, creates
      a conn_rec for it, and sends it to the Reader thread.

2. Reader thread
      A Reader thread runs a poll loop to watch for incoming
      data on all connections that have been passed to it by a
      Listener or Writer.  It reads the next request from each
      connection, builds a request_rec, and passes the conn_rec
      and the request_rec on to the Request Processor thread
      pool.

3. Request Processor threads
      Each Request Processor thread handles one request_rec
      at a time.  When it receives a request from the Reader
      thread, the Request Processor runs all the request
      processing hooks (auth, map to storage, handler, etc)
      except the logger, plus the output filter stack except
      the core_output_filter.  As the Request Processor produces
      output brigades, it sends them to the Writer thread pool.
      Once the Request processor has finished handling the
      request, it sends the last of the output data, plus
      the request_rec, to the Writer.

4. Writer thread
      The Writer thread runs a poll loop to output the data
      for all connections that have been passed to it.  When
      it finishes writing the response for a request, the
      Writer calls the logger, destroys the request_rec,
      and either executes the lingering_close on the connection
      or sends the connection back to the Reader, depending on
      whether the connection is a keep-alive.


Component details
-----------------

* Listener thread: This thread will need to use an accept_mutex
  to serialize the accept, just like 2.0 does.

* Passing connections from Listener to Reader:  When the
  Listener creates a new connection, it adds it to a global
  queue and writes one byte to a pipe.  The other end of the
  pipe is in the Reader's pollset.  When the poll(2) in the
  Reader completes, the Reader detects the data available on
  the pipe, reads and discards the byte, and retrieves all
  the new connections in the queue.

* Passing connections from Reader to Request Processor:  When
  the Reader has consumed all the data in a connection, it
  adds the connection and the newly created request_rec to
  a global queue and signals a condition variable.  The
  idle Request Processor threads take turns waiting on the
  condition variable (leader/followers model).

* Passing output brigades from Request Processor to Writer:
  Same model as the Listener-to-Reader handoff: add to a
  queue, and write a byte to a pipe.

* Bucket management:  Implicit in this design is the idea that
  the Writer thread can be writing part of an HTTP response
  while a Request Processor thread is still generating more
  buckets for that request.  This is a good thing because it
  means that the Request Processor thread won't ever find itself
  blocked on a network write, so it can produce all its output
  quickly and move on to another request (which is the key to
  keeping the number of threads low).  However, it does mean
  that we need a thread-safe solution for allocating and
  destroying buckets and brigades.

* request_rec lifetime:  When a Request Processor thread has
  produced all of the output for a response, it adds a metadata
  bucket to the last output brigade.  This bucket points to the
  request_rec.  Upon sending the last of the request's output,
  the Writer thread is responsible for calling the logger and
  the destroying the request and its pool.  This would be a major
  change from how 1.x and 2.0 work.  The rationale for it is
  twofold:
    - Eliminate the need to set aside buckets from the request
      pool into the connection pool in the core_output_filter,
      which has been a source of many bugs in 2.0.
    - Allow for more accurate logging of bytes_sent (e.g., in
      mod_logio) by delaying the logger until the request has
      actually been sent.
  One implication of this change is that the request pool could
  no longer be a sub-pool of the connection pool, unless we make
  subpool creation a thread-safe operation.


Open questions
--------------
* Limiting the Reader and Writer pools to one thread each will
  simplify the design and implementation.  But will this impair
  our ability to take advantage of lots of CPUs?

* Can we eliminate the listener thread?  It would be faster to just
  have the Reader thread include the listen socket(s) in its pollset.
  But if we did that, we'd need some new way to synchronize the
  accept handling among multiple child processes, because we can't
  have the Reader thread blocking on an accept mutex when it has
  existing connections to watch.

* Is there a more efficient way to interrupt a thread that's
  blocked in a poll call?  That's a crucial step in the Listener-to-
  Reader and Request Processor-to-Writer handoffs.  Writing a byte
  to a pipe requires two extra syscalls (a read and a write) per
  handoff.  Sending a signal to the target thread is the only
  other solution I can think of at the moment, but that's bad
  because the target thread might be in the middle of a read
  or write call, rather than a poll, at the moment when we hit
  it with a signal, so the read or write will fail with EINTR.

  Maybe the best solution would be a hybrid: using atomic
  operations, have the Reader maintain a flag that indicates
  whether it's blocked on a poll call or not.  If the Listener
  sees that the reader is blocked in a poll, it sends a signal
  to the listener to interrupt the poll; otherwise, it just
  adds the new connection to the queue and expects the listener
  to check the queue again before its next poll call.

* Do any major modules have a need to do blocking I/O or
  expensive computation within their input handlers?  That
  would cause problems for the single Reader thread, which
  depends on input handlers running quickly so it can get
  back to its poll loop.

Re: Another async I/O proposal [was Re: request for comments: multiple-connections-per-thread MPM design]

Posted by Manoj Kasichainula <ma...@io.com>.

On Mon, Nov 25, 2002 at 08:10:12AM -0800, Brian Pane wrote:
> On Mon, 2002-11-25 at 00:02, Manoj Kasichainula wrote:
> > while (event = get_next_event())
> >    add more spare threads if needed
> >    event_processor = lookup_event_processor(event)
> >    ticket = event_processor(event)
> >    if (ticket) submit_ticket(ticket)
> >    exit loop (and thus end thread) if not needed
> > 
> > The event_processor can take as long as it wants, since there are other
> > threads who can wait for the next event.
> 
> Where is the locking done?  Is the lock just around the
> get_next_event() call?

Yeah, I imagined the locking would be implicit in there. Different event
mechanisms on various OSes could require different locking schemes, so
if locking is needed, it should be hidden there.

> Once the httpd_request_processor() has created a new ticket for
> the write, how does the submit_ticket() get the socket added into
> the pollset?  If it's possible for another request to be in the
> middle of a poll call at the same time, does submit_ticket()
> interrupt the poll in order to add the new descriptor?

This is a problem I missed somehow. I mentioned it in the other branch
of the thread.

> - Flow control will be difficult.  Here's a tricky scenario I
>   just thought of:  The server is configured to run 10 threads.
>   Most of the time, it only needs a couple of them, because it's
>   serving mostly static content and and occasional PHP request.
>   Suddenly, it gets a flood of requests for PHP pages.  The first
>   ten of these quickly take over the ten available threads.
>   PHP doesn't know how to work in an event-driven world, each
>   of these requests holds onto its thread for a long time.  When
>   one of them finally completes, it produces some content to be
>   written.  But the writes may be starved, because the first
>   thread that finishes its PHP request and goes back into the
>   select loop might find another incoming request and read it
>   before doing any writes.  And if that new request is another
>   long-running PHP request, it could be a while before we finally
>   get around to doing the write.

Hmm, yeah, this is a concern. One answer is to set a very high
MaxThreadLimit, but then you can't control how many PHP threads you
have. Another answer is to reserve some threads for I/O, which your
design does.

>   It's possible to partly work around this by implementing
>   get_next_event() so that it completes all pending, unblocked
>   writes before returning.  But more generally, we'll need some
>   solution to keep long-running, non-event-based requests from
>   taking over all the server threads.  (This is true of my design
>   as well.)

Actually, in your design, since you have seperate threads for I/O, I
don't see why it would suffer.

Re: Another async I/O proposal [was Re: request for comments: multiple-connections-per-thread MPM design]

Posted by Brian Pane <br...@cnet.com>.

On Mon, 2002-11-25 at 00:02, Manoj Kasichainula wrote:
> I have some suggestions for Brian's design proposal which I'm pondering
> and writing up in another message, but meanwhile, I have an alternate
> proposal that I've been rolling around inside my head for months now, so
> I figured I might as well write it up.
> 
> It involves (mostly) a single pool of threads all running through an
> event loop. I think the below could be written as a single MPM for a
> specific operating system, or a generic MPM optimized for many OSes, or
> just APR.
> 
> It is also a hybrid sync/async approach, but most aspects of the approach
> can be handled by a single thread pool instead of multiple.
> 
> Please punch holes in this proposal at will.

In general, I like this design.  It provides a simple solution
for mixing event-driven and non-event-driven modules in the same
server.  I see a few problems, though, as detailed below:

> Definitions
> -----------
> 
> Ticket - something to do, e.g. [READ, fd], [LISTEN, fd], [WRITE, fd,
> buckets]. It's a request for the main event loop to give us back an
> event.
> 
> Event - something that has been done (with some of the data used in it)
> and its result, e.g. [READ, buckets], [LISTEN, fd], [WRITE], etc.
> 
> Both of the above include contexts for state maintenance of course.
> 
> Event processor - receives events, processes them, decides on
> consequences, and returns a new ticket to handle, or NULL if there is
> none
> 
> 
> Design
> ------
> 
> We have a single pool of threads, growing and shrinking as needed, in a
> standard event-handling loop:
> 
> while (event = get_next_event())
>    add more spare threads if needed
>    event_processor = lookup_event_processor(event)
>    ticket = event_processor(event)
>    if (ticket) submit_ticket(ticket)
>    exit loop (and thus end thread) if not needed
> 
> The event_processor can take as long as it wants, since there are other
> threads who can wait for the next event.

Where is the locking done?  Is the lock just around the
get_next_event() call?

> Tickets could be handled in multiple disjoint iterations of the event
> loop, but the event processors never see this. This is how Windows can
> process a WRITE ticket for a file bucket with TransmitFile w/ completion
> ports, Linux can (IIRC) use a non-blocking sendfile loop, and an
> old-school unix can use a read-write loop. Note that I did mention
> platform-specific code; does APR know how to do async and nonblocking
> I/O for various platforms in the optimal way? If not, this loop could.

APR handles much of the work: it provides a sendfile API, for example,
that's ifdef'ed to do sendfilev on Solaris, sendfile on Linux, and
mmap+writev on older platforms.  Based on our experiences with the
core_output_filter in 2.0, though, I expect that get_next_event()
will still have to do some platform-specific processing so that it
knows when to cork/un-cork the connection on Linux, for example.

> submit_ticket and get_next_event work together to provide the smarts of
> the loop. On old-school unix, submit_ticket would take a ticket and set
> up the fd_set, and get_next_event would select() on the fd_set and do
> what's appropriate, which doesn't always involve a quick system call and
> a return of an event. For example, while handling a WRITE ticket, we
> might only be able to partially complete the write without blocking. In
> that case, get_next_event could rejigger the fd_set and go back to the
> select() call.
> 
> HTTP's event_processors, in a simple case where all handlers read HTTP
> request data, process it, then return looks sort of like:
> 
> http_listen_processor = http_request_processor
>    
> http_request_processor(event)
>     input_buckets += get_buckets(event)
>     if (need_more_for_this_request)
>         return new_read_ticket(fd, http_request_processor, context)
>     else
>         /* Next line can take a long time and can be written in a
>          * blocking fashion */
>         output_buckets = request_handler(fd, input_buckets)
>         return new_write_ticket(fd, output_buckets,
>                                 http_keepalive_processor, context)


Once the httpd_request_processor() has created a new ticket for
the write, how does the submit_ticket() get the socket added into
the pollset?  If it's possible for another request to be in the
middle of a poll call at the same time, does submit_ticket()
interrupt the poll in order to add the new descriptor?


> http_keepalive_processor(event)
>     if (keepalive)
>         return NULL
>     else
>         return new_read_ticket(fd, http_request_processor, context)
> 
> If we want to allow it, the request_handler() call above could even do
> its own reading and writing of the file descriptor.
> 
> In the single process case on old-school Unix, submit_ticket can just
> tell get_next_event to select+accept w/ a simple mutex around them.  In
> the multiple process case, it can wait on a queue for an outside
> listener thread like in Brian's description. And in some Unixes (and I
> believe Windows with completion ports), the multiprocess case isn't a
> concern. Linux 2.6 could use epoll and avoid all these issues, and 2.4
> has a realtime signal interface to do the same thing I believe.
> 
> I've glossed over where the conn_recs and request_recs get built.
> That's mainly because I don't know how the multi-protocol stuff deals
> with request_recs :). I would expect conn_recs to be completely generic,
> and request_recs to be somewhat or completely http-specific. Generic
> portions could go into the main event loop, HTTP portions go into the
> http event processors.
> 
> Disadvantages of this proposal I can think of offhand:
> 
> - Because threads are mostly in one large pool, some common structures
>   have to be protected through a mutex. I like paying for mutexes more
>   than paying for context switches though.
> 
> - We're creating a destroying a lot of "objects" (tickets and events).
>   I don't think there'll be much overhead since these aren't real OO
>   objects, but we have to be careful

One more disadvantage:

- Flow control will be difficult.  Here's a tricky scenario I
  just thought of:  The server is configured to run 10 threads.
  Most of the time, it only needs a couple of them, because it's
  serving mostly static content and and occasional PHP request.
  Suddenly, it gets a flood of requests for PHP pages.  The first
  ten of these quickly take over the ten available threads.  As
  PHP doesn't know how to work in an event-driven world, each
  of these requests holds onto its thread for a long time.  When
  one of them finally completes, it produces some content to be
  written.  But the writes may be starved, because the first
  thread that finishes its PHP request and goes back into the
  select loop might find another incoming request and read it
  before doing any writes.  And if that new request is another
  long-running PHP request, it could be a while before we finally
  get around to doing the write.

  It's possible to partly work around this by implementing
  get_next_event() so that it completes all pending, unblocked
  writes before returning.  But more generally, we'll need some
  solution to keep long-running, non-event-based requests from
  taking over all the server threads.  (This is true of my design
  as well.)


Brian

Another async I/O proposal [was Re: request for comments: multiple-connections-per-thread MPM design]

Posted by Manoj Kasichainula <ma...@io.com>.

I have some suggestions for Brian's design proposal which I'm pondering
and writing up in another message, but meanwhile, I have an alternate
proposal that I've been rolling around inside my head for months now, so
I figured I might as well write it up.

It involves (mostly) a single pool of threads all running through an
event loop. I think the below could be written as a single MPM for a
specific operating system, or a generic MPM optimized for many OSes, or
just APR.

It is also a hybrid sync/async approach, but most aspects of the approach
can be handled by a single thread pool instead of multiple.

Please punch holes in this proposal at will.

Definitions
-----------

Ticket - something to do, e.g. [READ, fd], [LISTEN, fd], [WRITE, fd,
buckets]. It's a request for the main event loop to give us back an
event.

Event - something that has been done (with some of the data used in it)
and its result, e.g. [READ, buckets], [LISTEN, fd], [WRITE], etc.

Both of the above include contexts for state maintenance of course.

Event processor - receives events, processes them, decides on
consequences, and returns a new ticket to handle, or NULL if there is
none


Design
------

We have a single pool of threads, growing and shrinking as needed, in a
standard event-handling loop:

while (event = get_next_event())
   add more spare threads if needed
   event_processor = lookup_event_processor(event)
   ticket = event_processor(event)
   if (ticket) submit_ticket(ticket)
   exit loop (and thus end thread) if not needed

The event_processor can take as long as it wants, since there are other
threads who can wait for the next event.

Tickets could be handled in multiple disjoint iterations of the event
loop, but the event processors never see this. This is how Windows can
process a WRITE ticket for a file bucket with TransmitFile w/ completion
ports, Linux can (IIRC) use a non-blocking sendfile loop, and an
old-school unix can use a read-write loop. Note that I did mention
platform-specific code; does APR know how to do async and nonblocking
I/O for various platforms in the optimal way? If not, this loop could.

submit_ticket and get_next_event work together to provide the smarts of
the loop. On old-school unix, submit_ticket would take a ticket and set
up the fd_set, and get_next_event would select() on the fd_set and do
what's appropriate, which doesn't always involve a quick system call and
a return of an event. For example, while handling a WRITE ticket, we
might only be able to partially complete the write without blocking. In
that case, get_next_event could rejigger the fd_set and go back to the
select() call.

HTTP's event_processors, in a simple case where all handlers read HTTP
request data, process it, then return looks sort of like:

http_listen_processor = http_request_processor
   
http_request_processor(event)
    input_buckets += get_buckets(event)
    if (need_more_for_this_request)
        return new_read_ticket(fd, http_request_processor, context)
    else
        /* Next line can take a long time and can be written in a
         * blocking fashion */
        output_buckets = request_handler(fd, input_buckets)
        return new_write_ticket(fd, output_buckets,
                                http_keepalive_processor, context)

http_keepalive_processor(event)
    if (keepalive)
        return NULL
    else
        return new_read_ticket(fd, http_request_processor, context)

If we want to allow it, the request_handler() call above could even do
its own reading and writing of the file descriptor.

In the single process case on old-school Unix, submit_ticket can just
tell get_next_event to select+accept w/ a simple mutex around them.  In
the multiple process case, it can wait on a queue for an outside
listener thread like in Brian's description. And in some Unixes (and I
believe Windows with completion ports), the multiprocess case isn't a
concern. Linux 2.6 could use epoll and avoid all these issues, and 2.4
has a realtime signal interface to do the same thing I believe.

I've glossed over where the conn_recs and request_recs get built.
That's mainly because I don't know how the multi-protocol stuff deals
with request_recs :). I would expect conn_recs to be completely generic,
and request_recs to be somewhat or completely http-specific. Generic
portions could go into the main event loop, HTTP portions go into the
http event processors.

Disadvantages of this proposal I can think of offhand:

- Because threads are mostly in one large pool, some common structures
  have to be protected through a mutex. I like paying for mutexes more
  than paying for context switches though.

- We're creating a destroying a lot of "objects" (tickets and events).
  I don't think there'll be much overhead since these aren't real OO
  objects, but we have to be careful

Advantages:

- Async I/O, introduced gradually throughout the server. At first, this
  can just be yet another MPM, with no change to the rest of the server.
  But eventually, it could allow both completely event-driven and
  completely synchronous protocol handlers.  The event-driven protocol
  handlers can then allow event-driven user modules if they choose, or
  run user modules synchronously, or some combination of the 2. A server
  filled only with event-driven protocols and event-driven modules can
  run almost as low as one thread per CPU, with no other tweaking.

- There's no bottleneck where a single thread might block unexpectedly
  and hold up the rest of the process, unless we're forced to put a
  mutex around a suspect system call. I don't think there is in Brian's
  design either, but I haven't thought it through completely :)
  
- The framework can be reused by different operating systems, each
  optimizing as much or as little as they see fit, or all wrapped in APR
  if we choose. submit_ticket and get_next_event should be the only
  calls that need to be replaced.

- Minimized context switches. If get_next_event is crafted
  appropriately, we could even have thread affinity for connections,
  meaning that if there's only one connection coming in at a time, only
  one thread ever runs

- Transparent support for multiple CPUs

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Glenn <gs...@gluelogic.com>.

On Thu, Dec 12, 2002 at 12:39:17AM -0800, Manoj Kasichainula wrote:
...
> > Add a descriptor (pipe, socket, whatever) to the pollset and use
> > it to indicate the need to generate a new pollset.  The thread that sends
> > info down this descriptor could be programmed to wait a short amount of
> > time between sending triggers, so as not to cause the select() to return
> > too, too often, but short enough not to delay the handling of new
> > connections too long.
> 
> But what's a good value?
...
> Hmmm, if the poll is waiting on fds for any length of time, it should be
> ok to interrupt it, because by definition it's not doing anything else.
> 
> So maybe the way to go is to forget about waiting the 0.1 s to interrupt
> poll. Just notify it immediately when there's a fd waiting to be polled.
> If no other fds have work to provide, we add the new fds to the poll set
> and continue.
> Otherwise, just run through all the other fds that need handling first,
> then pick off all the fds that are waiting for polling and add them to
> the fd set.
> 
> So (again using terms from my proposal):
> 
> submit_ticket would push fds into a queue and write to new_event_pipe if
> the queue was empty when pushing.
> 
> get_next_event would do something like:
> 
> if (previous_poll_fds_remaining) {
>     pick one off, call event handler for it
> }
> else {
>     clean out new_event_queue and put values into new poll set
>     poll(pollfds, io_timeout);
>     call event handler for one of the returned pollfds
> }
...

+1 on concept with comments:
Each time poll returns to handle ready fds, it should skip new_event_pipe
(it should not send than fd to an event handler), and it should check
new_event_queue for fds to add to the pollset before it returns to polling.

It should always be doing useful work or should be blocking in select(),
because it will always have at least one fd -- it's end of new_event_pipe --
in its pollset.

Coding to interrupt the poll immediately is the first thing to do, and
then a max short delay can be added to submit_ticket only if necessary.

As you said, the max short delay would only affect the unbusy case where
the poll is waiting on all current members of the pollset.  The short
delay had been suggested to prevent interrupting select() before select()
had a chance to do any useful work.  We won't know if this is a real or
imagined problem until it is tested.  It sounds like it won't be a
performance problem, although using the max short timer of even 0.05s might
slightly reduce the CPU usage of these threads when under heavy load.

-Glenn

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Manoj Kasichainula <ma...@io.com>.

Took too long to respond. Oh well, no one else did either...

On Tue, Nov 26, 2002 at 01:14:10AM -0500, Glenn wrote:
> On Mon, Nov 25, 2002 at 08:36:59PM -0800, Manoj Kasichainula wrote:
> > BTW, ISTR Ryan commenting a while back that cross-thread signalling
> > isn't reliable, and it scares me in general, so I'd lean towards the
> > pipe.
> > 
> > I'm pondering what else could be done about this; having to muck with a
> > pipe doesn't feel like the right thing to do.
> 
> Why not?

Good question. I'm still waffling on this.

> Add a descriptor (pipe, socket, whatever) to the pollset and use
> it to indicate the need to generate a new pollset.  The thread that sends
> info down this descriptor could be programmed to wait a short amount of
> time between sending triggers, so as not to cause the select() to return
> too, too often, but short enough not to delay the handling of new
> connections too long.

But what's a good value? Any value picked is going to be too annoying.
0.1 s means delaying lots of threads up to a tenth of a second. And
there would be good reasons for wanting to lower that value, and to not
lower that value. Which would mean it would need to be a tunable
parameter depending on network and CPU characteristics, and needing a
tunable parameter for this just seems ooky. 

But just picking a good value and sticking with it might not be too bad.
The correct thing to do would be to code it up and test, but I'd rather
have a reasonable idea of the chances for success first. :)

In the perfect case, each poll call would return immediately with lots
of file descriptors ready for work, and they would all get farmed out.
Then before the next poll runs, there are more file descriptors ready to
be polled. 

Hmmm, if the poll is waiting on fds for any length of time, it should be
ok to interrupt it, because by definition it's not doing anything else.

So maybe the way to go is to forget about waiting the 0.1 s to interrupt
poll. Just notify it immediately when there's a fd waiting to be polled.
If no other fds have work to provide, we add the new fds to the poll set
and continue.

Otherwise, just run through all the other fds that need handling first,
then pick off all the fds that are waiting for polling and add them to
the fd set.

So (again using terms from my proposal):

submit_ticket would push fds into a queue and write to new_event_pipe if
the queue was empty when pushing.

get_next_event would do something like:

if (previous_poll_fds_remaining) {
    pick one off, call event handler for it
}
else {
    clean out new_event_queue and put values into new poll set
    poll(pollfds, io_timeout);
    call event handler for one of the returned pollfds
}

Something was bothering me about this earlier, and I can't remember what
it is. Maybe it's that when the server isn't busy, a single ticket
submission will make 2 threads (the ticket submitter and the thread
holding the poll mutex) do stuff. Maybe even 3 threads since a new
thread could take the poll mutex. But since this is the unbusy case,
it's not quite so bad.

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Glenn <gs...@gluelogic.com>.

On Mon, Nov 25, 2002 at 08:36:59PM -0800, Manoj Kasichainula wrote:
> On Mon, Nov 25, 2002 at 07:12:43AM -0800, Brian Pane wrote:
> > The real reason I don't like the mutex around the poll is that
> > it would add too much latency if we had to wait for the current
> > poll to complete before adding a new descriptor.  When the
> > Listener accepts a new connection, or a Request Processor creates
> > a new response brigade, it needs to get the corresponding socket
> > added to the pollset immediately, which really requires interrupting
> > the current poll.
> 
> Hmmm. That's a problem that needs solving even without the mutex though
> (and it affects the design I proposed yesterday as well).  When you're
> adding a new fd to the reader or writer, you have to write to a pipe or
> send a signal. The mutex shouldn't affect that. 
> 
> BTW, ISTR Ryan commenting a while back that cross-thread signalling
> isn't reliable, and it scares me in general, so I'd lean towards the
> pipe.
> 
> I'm pondering what else could be done about this; having to muck with a
> pipe doesn't feel like the right thing to do.

Why not?  Add a descriptor (pipe, socket, whatever) to the pollset and use
it to indicate the need to generate a new pollset.  The thread that sends
info down this descriptor could be programmed to wait a short amount of
time between sending triggers, so as not to cause the select() to return
too, too often, but short enough not to delay the handling of new
connections too long.  And the select()er thread would need to add a quick
step to check for this special descriptor instead of treating them all as
external requests.  It would also need to somehow signal the other thread
each time select() returned so that waiting descriptors could be added
immediately.

Or am I smoking what Manoj is smoking?

-Glenn

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Manoj Kasichainula <ma...@io.com>.

On Mon, Nov 25, 2002 at 08:36:59PM -0800, Me at IO wrote:
> I'm just guessing here, but I imagine most CPU effort wouldn't be
> expended in the actual kernel<->user transitions that are polls and
> non-blocking I/O.  And the meat of those operations could be handled by
> other CPUs at the kernel level. So that separation onto multiple
> CPUs might not help much.

Eh, I was on crack when I wrote this. You want an I/O thread per CPU
when you can get it.

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Manoj Kasichainula <ma...@io.com>.

On Mon, Nov 25, 2002 at 07:12:43AM -0800, Brian Pane wrote:
> On Mon, 2002-11-25 at 00:20, Manoj Kasichainula wrote:
> > I was actually wondering why the reader and writer were seperate
> > threads.
> 
> It was a combination of several factors that convinced me
> to make them separate:
> * Take advantage of multiple CPUs more easily

Yeah, but as you noticed, once you get more than 2 CPUs, you have the
same problem.

I'm just guessing here, but I imagine most CPU effort wouldn't be
expended in the actual kernel<->user transitions that are polls and
non-blocking I/O.  And the meat of those operations could be handled by
other CPUs at the kernel level. So that separation onto multiple
CPUs might not help much.

> * Reduce the number of file descriptors that each poll call
>   is handling (important on platforms where we don't have
>   an efficient poll mechanism)

Has anyone read or benchmarked whether 2 threads polling 500 fds is
faster than 1 thread polling 1000?

> > For Linux 2.6, file notifications could be done entirely in userland in
> > the case where no blocking is needed, using "futexes".
> 
> Thanks!  I'll check out futexes.

Note that futexes are just Fast User mUTEXES. Those are already in the
kernel (according to some threads I read yesterday anyway). But I
beleive the part about file notification using them is still in
discussion.

> > But if you want to avoid the extra system calls, you could put a mutex
> > around maintenence of the pollset and just let the various threads dork
> > with it directly.
> > 
> > I do keep mentioning this mutex around the select/poll :). Is there a
> > performance reason that you're trying to avoid it? In my past skimmings,
> > I've seen you post a lot of benchmarks and such, so maybe you've studied
> > this.
> 
> The real reason I don't like the mutex around the poll is that
> it would add too much latency if we had to wait for the current
> poll to complete before adding a new descriptor.  When the
> Listener accepts a new connection, or a Request Processor creates
> a new response brigade, it needs to get the corresponding socket
> added to the pollset immediately, which really requires interrupting
> the current poll.

Hmmm. That's a problem that needs solving even without the mutex though
(and it affects the design I proposed yesterday as well).  When you're
adding a new fd to the reader or writer, you have to write to a pipe or
send a signal. The mutex shouldn't affect that. 

BTW, ISTR Ryan commenting a while back that cross-thread signalling
isn't reliable, and it scares me in general, so I'd lean towards the
pipe.

I'm pondering what else could be done about this; having to muck with a
pipe doesn't feel like the right thing to do. Perhaps I should actually
look at other people's code to see what they do. Other designs have
threads for disk I/O and such, so there should be a way. I believe
Windows doesn't have this problem, or at least hides it better, because
completion ports are independent entities that don't interact with each
other as far as the user is concerned.

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Brian Pane <br...@cnet.com>.

On Mon, 2002-11-25 at 00:20, Manoj Kasichainula wrote:
> On Sat, Nov 23, 2002 at 06:40:58PM -0800, Brian Pane wrote:
> > Here's an outline of my latest thinking on how to build a
> > multiple-connections-per-thread MPM for Apache 2.2.  I'm
> > eager to hear feedback from others who have been researching
> > this topic.
> 
> You prodded me into finally writing up a proposal that's been bouncing
> around in my head for a while now. That was in a seperate message, this
> will be suggestions for your proposal.
> 
> > 1. Listener thread
> >       A Listener thread accept(2)s a connection, creates
> >       a conn_rec for it, and sends it to the Reader thread.
> 
> Some (Most?) protocols have the server initiate the protocol
> negotatiation instead of the client, so the listener needs to be able to
> pass off to the writer thread as well.
> 
> > * Limiting the Reader and Writer pools to one thread each will
> >   simplify the design and implementation.  But will this impair
> >   our ability to take advantage of lots of CPUs?
> 
> I was actually wondering why the reader and writer were seperate
> threads.

It was a combination of several factors that convinced me
to make them separate:
* Take advantage of multiple CPUs more easily
* Simplify the application logic
* Reduce the number of file descriptors that each poll call
  is handling (important on platforms where we don't have
  an efficient poll mechanism)

> What gets more complex with a thread pool > 1? I know we'd have to add a
> mutex around the select+(read|write), but is there something else?

If you split the pollset into 'n' sections and have 'n'
threads each handling reads or writes on one section of
it, it can be hard to balance the load.  Some threads
will end up with very active connections, while others
will have mostly idle connections.

The alternative is to have 'n' threads that take turns
handling the entire pollset.  That doesn't offer as much
concurrency, so I'm not sure if it's worth the extra
complexity.  But it would be easy to test.


> > * Can we eliminate the listener thread?  It would be faster to just
> >   have the Reader thread include the listen socket(s) in its pollset.
> >   But if we did that, we'd need some new way to synchronize the
> >   accept handling among multiple child processes, because we can't
> >   have the Reader thread blocking on an accept mutex when it has
> >   existing connections to watch.
> 
> You could dispense with the listener thread in the single-process case
> and just use an intraprocess mutex around select+(accept|read|write)

Right, with the accept/read/write all handled by the same
thread (or thread pool), the handoff problem goes away.

> > * Is there a more efficient way to interrupt a thread that's
> >   blocked in a poll call?  That's a crucial step in the Listener-to-
> >   Reader and Request Processor-to-Writer handoffs.  Writing a byte
> >   to a pipe requires two extra syscalls (a read and a write) per
> >   handoff.  Sending a signal to the target thread is the only
> >   other solution I can think of at the moment, but that's bad
> >   because the target thread might be in the middle of a read
> >   or write call, rather than a poll, at the moment when we hit
> >   it with a signal, so the read or write will fail with EINTR.
> 
> For Linux 2.6, file notifications could be done entirely in userland in
> the case where no blocking is needed, using "futexes".

Thanks!  I'll check out futexes.

> 
> But if you want to avoid the extra system calls, you could put a mutex
> around maintenence of the pollset and just let the various threads dork
> with it directly.
> 
> I do keep mentioning this mutex around the select/poll :). Is there a
> performance reason that you're trying to avoid it? In my past skimmings,
> I've seen you post a lot of benchmarks and such, so maybe you've studied
> this.

The real reason I don't like the mutex around the poll is that
it would add too much latency if we had to wait for the current
poll to complete before adding a new descriptor.  When the
Listener accepts a new connection, or a Request Processor creates
a new response brigade, it needs to get the corresponding socket
added to the pollset immediately, which really requires interrupting
the current poll.

Brian

Re: request for comments: multiple-connections-per-thread MPM design

Posted by Manoj Kasichainula <ma...@io.com>.

On Sat, Nov 23, 2002 at 06:40:58PM -0800, Brian Pane wrote:
> Here's an outline of my latest thinking on how to build a
> multiple-connections-per-thread MPM for Apache 2.2.  I'm
> eager to hear feedback from others who have been researching
> this topic.

You prodded me into finally writing up a proposal that's been bouncing
around in my head for a while now. That was in a seperate message, this
will be suggestions for your proposal.

> 1. Listener thread
>       A Listener thread accept(2)s a connection, creates
>       a conn_rec for it, and sends it to the Reader thread.

Some (Most?) protocols have the server initiate the protocol
negotatiation instead of the client, so the listener needs to be able to
pass off to the writer thread as well.

> * Limiting the Reader and Writer pools to one thread each will
>   simplify the design and implementation.  But will this impair
>   our ability to take advantage of lots of CPUs?

I was actually wondering why the reader and writer were seperate
threads.

What gets more complex with a thread pool > 1? I know we'd have to add a
mutex around the select+(read|write), but is there something else?

> * Can we eliminate the listener thread?  It would be faster to just
>   have the Reader thread include the listen socket(s) in its pollset.
>   But if we did that, we'd need some new way to synchronize the
>   accept handling among multiple child processes, because we can't
>   have the Reader thread blocking on an accept mutex when it has
>   existing connections to watch.

You could dispense with the listener thread in the single-process case
and just use an intraprocess mutex around select+(accept|read|write)

> * Is there a more efficient way to interrupt a thread that's
>   blocked in a poll call?  That's a crucial step in the Listener-to-
>   Reader and Request Processor-to-Writer handoffs.  Writing a byte
>   to a pipe requires two extra syscalls (a read and a write) per
>   handoff.  Sending a signal to the target thread is the only
>   other solution I can think of at the moment, but that's bad
>   because the target thread might be in the middle of a read
>   or write call, rather than a poll, at the moment when we hit
>   it with a signal, so the read or write will fail with EINTR.

For Linux 2.6, file notifications could be done entirely in userland in
the case where no blocking is needed, using "futexes".

But if you want to avoid the extra system calls, you could put a mutex
around maintenence of the pollset and just let the various threads dork
with it directly.

I do keep mentioning this mutex around the select/poll :). Is there a
performance reason that you're trying to avoid it? In my past skimmings,
I've seen you post a lot of benchmarks and such, so maybe you've studied
this.

I'm suspicious of signals, but as long as they are tightly controlled
with sigprocmask or pthread_sigmask, I guess they aren't so bad.