You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Brian Pane <br...@apache.org> on 2005/10/10 08:50:39 UTC

async write completion prototype

With the batch of commits I did this weekend, the Event MPM in
the async-dev Subversion branch now does write completion
in a nonblocking manner.  Once an entire response has been
generated and passed to the output filter chain, the MPM's
poller/listener thread watches the connection for writability
events.  When the connection becomes writable, the poller
thread sends it to one of the worker threads, which writes
some more output.

At this point, the event-handling code is ready for testing and
review by other developers.

The main changes on the async-dev branch (compared
to the 2.3-dev trunk) are:

1. ap_core_output_filter: rewrite to do nonblocking writes
    whenever possible.

2. core, http module, and mod_logio: removed the generation
    of flush buckets where possible.

3. request cleanup and logging: the logger phase and
    subsequent destruction of the request's pool are now
    triggered by the destruction of an End-Of-Request
    bucket in the core output filter.

4. event MPM: asynchronous handling of CONN_STATE_WRITE_COMPLETION.

There are several more things that need to be fixed in order
to make the asynchronous write completion useful in a
production release of httpd-2.x:

- The main pollset in the Event MPM currently is sized to
   hold up to one socket descriptor per worker thread.  With
   asynchronous keepalives and write completion, the pollset
   should accommodate many descriptors per thread.

- The logic for starting more child processes, which Event
   inherited from Worker, is based on assumptions about
   the number of concurrent connections being equal to
   the number of threads.  These assumptions aren't valid
   for a multiple-connections-per-thread MPM.

- Similarly, there may be some changes needed in the
   flow control logic that the listener thread uses to decide
   whether it can do an accept.

- The scoreboard probably needs a redesign.

- It may be valuable to have a separate thread pool to
   run handlers that do arbitrarily lengthy processing, such
   as mod_perl and mod_php.

Brian

Re: async write completion prototype

Posted by Phillip Susi <ps...@cfl.rr.com>.
Non blocking is not async IO.  It is not really possible to perform zero 
copy IO with non blocking IO semantics, you must have full async IO to 
issue multiple pending requests.

Brian Akins wrote:
> Phillip Susi wrote:
> 
>> As an alternative, you can bypass the cache and do direct async IO to 
>> the disk with zero copies.  IIRC, this is supported on linux with the 
>> O_DIRECT flag.  Doing this though, means that you will need to handle 
>> caching yourself, which might not be such a good idea.  Does Linux not 
>> support O_DIRECT on sockets?
> 
> 
> Can you not just set the socket to non-blocking using O_NONBLOCK?
> 


Re: async write completion prototype

Posted by Brian Akins <br...@turner.com>.
Phillip Susi wrote:

> As an alternative, you can bypass the cache and do direct async IO to 
> the disk with zero copies.  IIRC, this is supported on linux with the 
> O_DIRECT flag.  Doing this though, means that you will need to handle 
> caching yourself, which might not be such a good idea.  Does Linux not 
> support O_DIRECT on sockets?

Can you not just set the socket to non-blocking using O_NONBLOCK?

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: async write completion prototype

Posted by Phillip Susi <ps...@cfl.rr.com>.
On NT you can set the kernel buffer size on the socket to 0 ( with 
setockopt() or ioctlsocket()? ), and the NIC can DMA directly from the 
user buffers to send rather than copy to kernel space.  This of course, 
requires that you keep more than one pending async operation so the nic 
allways has a buffer availible so it can keep the line saturated.  If 
you memory map the source file from the disk, then zero copy IO can be 
done entirely from user space.  Is this optimization not availible on 
linux or freebsd?

As an alternative, you can bypass the cache and do direct async IO to 
the disk with zero copies.  IIRC, this is supported on linux with the 
O_DIRECT flag.  Doing this though, means that you will need to handle 
caching yourself, which might not be such a good idea.  Does Linux not 
support O_DIRECT on sockets?

By using this technique I have been able to achieve TransmitFile() 
performance levels entirely from user space, without any of the 
drawbacks of TransmitFile().  Specifically virtually zero cpu time is 
needed to saturate multiple 100 MBps links, pushing 11,820 KB/s, 
progress is known the entire time and can be canceled at any time, and a 
small handfull of threads can service thousonds of clients.

Paul Querna wrote:
> Phillip Susi wrote:
> 
>> On what OS?  Linux?  NT supports async IO on sockets rather nicely, as 
>> does FreeBSD iirc.
> 
> 
> The event MPM doesn't run on NT at all, only Unixes.
> 
> Yes, FreeBSD (and linux) support async_write().. But this requires that 
> you read the file off of disk, and into a buffer, and then copy it back 
> into the kernel.  With a non-blocking sendfile, we can avoid all the 
> data copying (aka 'zero copy'), and let the kernel do everything itself.
> 
> There is currently no such thing as 'async sendfile', which would be the 
> perfect solution for this use case.  There have been various people 
> mentioning it as an idea, but no one has gone out and done it.
> 
> 
> -Paul
> 


Re: async write completion prototype

Posted by Paul Querna <ch...@force-elite.com>.
Phillip Susi wrote:
> On what OS?  Linux?  NT supports async IO on sockets rather nicely, as 
> does FreeBSD iirc.

The event MPM doesn't run on NT at all, only Unixes.

Yes, FreeBSD (and linux) support async_write().. But this requires that 
you read the file off of disk, and into a buffer, and then copy it back 
into the kernel.  With a non-blocking sendfile, we can avoid all the 
data copying (aka 'zero copy'), and let the kernel do everything itself.

There is currently no such thing as 'async sendfile', which would be the 
perfect solution for this use case.  There have been various people 
mentioning it as an idea, but no one has gone out and done it.


-Paul

Re: async write completion prototype

Posted by Phillip Susi <ps...@cfl.rr.com>.
On what OS?  Linux?  NT supports async IO on sockets rather nicely, as 
does FreeBSD iirc.

Paul Querna wrote:
> Phillip Susi wrote:
> 
>> Nicely done.  Have you done any benchmarking to see if this improved 
>> performance as one would expect?  Would it be much more work to use 
>> true async IO instead of non blocking IO and polling?  What about 
>> doing the same for reads, as well as writes?
>>
> 
> All current async_io methods require you to read the data off disk, ie 
> there is no async_sendfile()....
> 
> Reads are much harder in the current httpd.  There are several core 
> functions that would need to be rewritten first.
> 
> -Paul
> 


Re: async write completion prototype

Posted by Paul Querna <ch...@force-elite.com>.
Phillip Susi wrote:
> Nicely done.  Have you done any benchmarking to see if this improved 
> performance as one would expect?  Would it be much more work to use true 
> async IO instead of non blocking IO and polling?  What about doing the 
> same for reads, as well as writes?
> 

All current async_io methods require you to read the data off disk, ie 
there is no async_sendfile()....

Reads are much harder in the current httpd.  There are several core 
functions that would need to be rewritten first.

-Paul

Re: async write completion prototype

Posted by Phillip Susi <ps...@cfl.rr.com>.
Nicely done.  Have you done any benchmarking to see if this improved 
performance as one would expect?  Would it be much more work to use true 
async IO instead of non blocking IO and polling?  What about doing the 
same for reads, as well as writes?

Brian Pane wrote:
> With the batch of commits I did this weekend, the Event MPM in
> the async-dev Subversion branch now does write completion
> in a nonblocking manner.  Once an entire response has been
> generated and passed to the output filter chain, the MPM's
> poller/listener thread watches the connection for writability
> events.  When the connection becomes writable, the poller
> thread sends it to one of the worker threads, which writes
> some more output.
> 
> At this point, the event-handling code is ready for testing and
> review by other developers.
> 
> The main changes on the async-dev branch (compared
> to the 2.3-dev trunk) are:
> 
> 1. ap_core_output_filter: rewrite to do nonblocking writes
>    whenever possible.
> 
> 2. core, http module, and mod_logio: removed the generation
>    of flush buckets where possible.
> 
> 3. request cleanup and logging: the logger phase and
>    subsequent destruction of the request's pool are now
>    triggered by the destruction of an End-Of-Request
>    bucket in the core output filter.
> 
> 4. event MPM: asynchronous handling of CONN_STATE_WRITE_COMPLETION.
> 
> There are several more things that need to be fixed in order
> to make the asynchronous write completion useful in a
> production release of httpd-2.x:
> 
> - The main pollset in the Event MPM currently is sized to
>   hold up to one socket descriptor per worker thread.  With
>   asynchronous keepalives and write completion, the pollset
>   should accommodate many descriptors per thread.
> 
> - The logic for starting more child processes, which Event
>   inherited from Worker, is based on assumptions about
>   the number of concurrent connections being equal to
>   the number of threads.  These assumptions aren't valid
>   for a multiple-connections-per-thread MPM.
> 
> - Similarly, there may be some changes needed in the
>   flow control logic that the listener thread uses to decide
>   whether it can do an accept.
> 
> - The scoreboard probably needs a redesign.
> 
> - It may be valuable to have a separate thread pool to
>   run handlers that do arbitrarily lengthy processing, such
>   as mod_perl and mod_php.
> 
> Brian
> 


Re: async write completion prototype

Posted by Greg Ames <gr...@apache.org>.
Greg Ames wrote:

> this is interesting to me because Brian Atkins recently reported that 

s/Atkins/Akins/   sorry, Brian

Greg

Re: async write completion prototype

Posted by Brian Akins <br...@turner.com>.
Greg Ames wrote:

> do you recall if CPU cycles were maxed out in both cases?

Yes.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: async write completion prototype

Posted by Greg Ames <gr...@apache.org>.
Brian Akins wrote:

> Basically, I was referring to the overall hits a box could serve per 
> second.
> 
> with 512 concurrent connections and about an 8k file, 2.1 with worker 
> served about 22k request/second.  event served about 14k.

do you recall if CPU cycles were maxed out in both cases?

thanks,
Greg

Re: async write completion prototype

Posted by Paul Querna <ch...@force-elite.com>.
Brian Pane wrote:
> On Oct 18, 2005, at 7:11 AM, Greg Ames wrote:
> 
>> Brian Pane wrote:
>>
>>> I think one contributor to the event results is an issue that Paul  
>>> Querna
>>> pointed out on #httpd-dev the other day: apr_pollset_remove runs in O(n)
>>> time with n descriptors in the pollset.
>>>
>>
>> thanks, I see it.  yeah we are going to have to do something about that.
> 
> I just committed a change to the epoll version that eliminates the
> O(n) loop--and the mutex operations and a bit of data structure
> copying.

Awesome! I really like it, a very nice addition to apr_pollset. I will 
try to update APR with KQueue support on Sunday.

> The version of the Event MPM on the async-dev branch takes
> advantage of this new feature.  I'm seeing a ~5% increase in
> throughput in a simple test setup (http_load on a client machine
> driving ~200 concurrent connections over 1Gb/s ethernet to
> Apache running on Linux 2.6).
> 
> If anybody with a more industrial-strength load testing setup
> can try the async-dev version of the Event MPM with a few
> thousand concurrent connections, I'm eager to hear whether
> this new epoll code yields a useful speedup.
I agree, I have had problems in the past telling if any changes to the 
Event MPM have good or bad performance implications.  Some of it is best 
guess, but it really would be nice to have a semi-reliable way to 
benchmark it that included Keep Alive connections.

-Paul




Re: async write completion prototype

Posted by Brian Pane <br...@apache.org>.
On Oct 18, 2005, at 7:11 AM, Greg Ames wrote:

> Brian Pane wrote:
>
>> I think one contributor to the event results is an issue that  
>> Paul  Querna
>> pointed out on #httpd-dev the other day: apr_pollset_remove runs  
>> in O(n)
>> time with n descriptors in the pollset.
>>
>
> thanks, I see it.  yeah we are going to have to do something about  
> that.

I just committed a change to the epoll version that eliminates the
O(n) loop--and the mutex operations and a bit of data structure
copying.

The version of the Event MPM on the async-dev branch takes
advantage of this new feature.  I'm seeing a ~5% increase in
throughput in a simple test setup (http_load on a client machine
driving ~200 concurrent connections over 1Gb/s ethernet to
Apache running on Linux 2.6).

If anybody with a more industrial-strength load testing setup
can try the async-dev version of the Event MPM with a few
thousand concurrent connections, I'm eager to hear whether
this new epoll code yields a useful speedup.

Thanks,
Brian


Re: async write completion prototype

Posted by Greg Ames <gr...@apache.org>.
Brian Pane wrote:
> I think one contributor to the event results is an issue that Paul  Querna
> pointed out on #httpd-dev the other day: apr_pollset_remove runs in O(n)
> time with n descriptors in the pollset.

thanks, I see it.  yeah we are going to have to do something about that.

Greg

Re: async write completion prototype

Posted by Brian Pane <br...@apache.org>.
I think one contributor to the event results is an issue that Paul  
Querna
pointed out on #httpd-dev the other day: apr_pollset_remove runs in O(n)
time with n descriptors in the pollset.

Brian

On Oct 13, 2005, at 11:36 AM, Brian Akins wrote:

> Greg Ames wrote:
>
>
>> this is interesting to me because Brian Atkins recently reported  
>> that the event MPM was much slower. http://mail- 
>> archives.apache.org/mod_mbox/httpd-dev/200509.mbox/% 
>> 3c43219161.3030102@web.turner.com%3e
>>
>
> No "t" in my last name :)
>
>
> Basically, I was referring to the overall hits a box could serve  
> per second.
>
> with 512 concurrent connections and about an 8k file, 2.1 with  
> worker served about 22k request/second.  event served about 14k.
>
> It's been a while since I did the test, and I'm too busy for the  
> next few days to re-run them.
>
>
> -- 
> Brian Akins
> Lead Systems Engineer
> CNN Internet Technologies
>


Re: async write completion prototype

Posted by Brian Akins <br...@turner.com>.
Greg Ames wrote:

> this is interesting to me because Brian Atkins recently reported that 
> the event MPM was much slower. 
> http://mail-archives.apache.org/mod_mbox/httpd-dev/200509.mbox/%3c43219161.3030102@web.turner.com%3e 

No "t" in my last name :)


Basically, I was referring to the overall hits a box could serve per second.

with 512 concurrent connections and about an 8k file, 2.1 with worker 
served about 22k request/second.  event served about 14k.

It's been a while since I did the test, and I'm too busy for the next 
few days to re-run them.


-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: async write completion prototype

Posted by Greg Ames <gr...@apache.org>.
Brian Pane wrote:
> On Oct 10, 2005, at 12:01 AM, Paul Querna wrote:

>> If the content has already been generated, why add the overhead of  
>> the context switch/sending to another thread?  Can't the same event  
>> thread do a non-blocking write?
>>
>> Once it finishes writing, then yes, we do require a context-switch  to 
>> another thread to do logging/cleanup.
>>
>> I am mostly thinking about downloading a 1 gig file with the  current 
>> pattern against a slow client.  A non-blocking write might  only do 
>> ~64k at a time, and causing 1 gig/64k context switches,  which seems 
>> less than optimal.
> 
> 
> If I had to choose, I'd rather do the context switches than devote a
> thread (and the associated stack space) to the connection until
> the writes are finished--especially if the server is delivering a
> thousand 1GB files to slow clients concurrently.
> 
> However, it's probably possible to have _both_ a high ratio
> of connections to threads (for scalability) and a low ratio of
> context switches to megabytes delivered (for efficiency).
> The Event MPM currently has to do a lot of context switching
> because it detects events in one thread and processes them
> in another.  If we add async write completion to the
> Leader/Followers MPM (or incorporate a leader/follower
> thread model into Event), it should reduce the context
> switches considerably.

this is interesting to me because Brian Atkins recently reported that 
the event MPM was much slower. 
http://mail-archives.apache.org/mod_mbox/httpd-dev/200509.mbox/%3c43219161.3030102@web.turner.com%3e

it would be nice to hear more details, but I assume that this means 
event is burning more CPU for a given workload rather than some kind of 
extra latency bug.  we know that event has more context switching than 
worker when keepalives are in use but pipelining is not, and async write 
completion will add to it.  I suppose we should profile event and worker 
and compare profiles in case there's some other unexpected CPU burner 
out there.

if context switch overhead is really the culprit, how do we reduce it? 
if I recall correctly, leader/follower sort of plays tag and the next 
thread that's It gets to be the listener.  I can see that running the 
request processing on the same thread that does the accept would be more 
cache friendly, and it might save some of the current queuing logic. 
but doesn't this have about the same amount of pthread library/scheduler 
overhead to "tag" the new listener and dispatch it as we have now waking 
up worker threads?

another brainstorm is to use a short keepalive timeout, like 200ms*, on 
the worker thread.  if it pops, turn the connection over to the event 
pollset using the remaining KeepAliveTimeout and give up the worker 
thread.

Greg

*200ms - the idea is to use something just big enough to cover most 
network round trip times, so we catch the case where the browser sends 
the next request immediately after getting our response.

Re: async write completion prototype

Posted by Brian Pane <br...@apache.org>.
On Oct 10, 2005, at 12:01 AM, Paul Querna wrote:

> Brian Pane wrote:
>
>> With the batch of commits I did this weekend, the Event MPM in
>> the async-dev Subversion branch now does write completion
>> in a nonblocking manner.  Once an entire response has been
>> generated and passed to the output filter chain, the MPM's
>> poller/listener thread watches the connection for writability
>> events.  When the connection becomes writable, the poller
>> thread sends it to one of the worker threads, which writes
>> some more output.
>>
>
> If the content has already been generated, why add the overhead of  
> the context switch/sending to another thread?  Can't the same event  
> thread do a non-blocking write?
>
> Once it finishes writing, then yes, we do require a context-switch  
> to another thread to do logging/cleanup.
>
> I am mostly thinking about downloading a 1 gig file with the  
> current pattern against a slow client.  A non-blocking write might  
> only do ~64k at a time, and causing 1 gig/64k context switches,  
> which seems less than optimal.

If I had to choose, I'd rather do the context switches than devote a
thread (and the associated stack space) to the connection until
the writes are finished--especially if the server is delivering a
thousand 1GB files to slow clients concurrently.

However, it's probably possible to have _both_ a high ratio
of connections to threads (for scalability) and a low ratio of
context switches to megabytes delivered (for efficiency).
The Event MPM currently has to do a lot of context switching
because it detects events in one thread and processes them
in another.  If we add async write completion to the
Leader/Followers MPM (or incorporate a leader/follower
thread model into Event), it should reduce the context
switches considerably.

> ...
>
>> - The main pollset in the Event MPM currently is sized to
>>   hold up to one socket descriptor per worker thread.  With
>>   asynchronous keepalives and write completion, the pollset
>>   should accommodate many descriptors per thread.
>>
>
> The pollset is auto-resizable.  That number is just the maximum  
> number of events that will ever be returned by a single call to  
> _poll().  This number if perfect for the number of threads, since  
> we can never dispatch to more than the number of threads we have...

Ah, thanks.  I missed that key point.  It makes more sense now.

Brian


Re: async write completion prototype

Posted by Paul Querna <ch...@force-elite.com>.
Brian Pane wrote:
> With the batch of commits I did this weekend, the Event MPM in
> the async-dev Subversion branch now does write completion
> in a nonblocking manner.  Once an entire response has been
> generated and passed to the output filter chain, the MPM's
> poller/listener thread watches the connection for writability
> events.  When the connection becomes writable, the poller
> thread sends it to one of the worker threads, which writes
> some more output.

If the content has already been generated, why add the overhead of the 
context switch/sending to another thread?  Can't the same event thread 
do a non-blocking write?

Once it finishes writing, then yes, we do require a context-switch to 
another thread to do logging/cleanup.

I am mostly thinking about downloading a 1 gig file with the current 
pattern against a slow client.  A non-blocking write might only do ~64k 
at a time, and causing 1 gig/64k context switches, which seems less than 
optimal.

...
> - The main pollset in the Event MPM currently is sized to
>   hold up to one socket descriptor per worker thread.  With
>   asynchronous keepalives and write completion, the pollset
>   should accommodate many descriptors per thread.

The pollset is auto-resizable.  That number is just the maximum number 
of events that will ever be returned by a single call to _poll().  This 
number if perfect for the number of threads, since we can never dispatch 
to more than the number of threads we have...

> - The scoreboard probably needs a redesign.

Yes.. it is completely unhelpful with the maximum number of connections 
being > number of threads. I suspect going forward there will be many 
other areas where this assumption is broken by the event mpm.

-Paul


Re: async write completion prototype

Posted by Jeff Trawick <tr...@gmail.com>.
On 10/10/05, Greg Ames <gr...@apache.org> wrote:

> > - The scoreboard probably needs a redesign.
>
> yep.  Jeff T and I discussed this offline a while back.  a scoreboard
> slot per connection definately has some appeal.

yes, the server needs a way to track all active connections in a
visible manner...  a mod_status report needs to show all active
work...

another concern is a module's own thread-tracking...  an example is
mod_whatkilledus...  in a threaded MPM, it tracks active requests by
pthread id and, after a crash, retrieves the pthread id again to see
what the active request was...  what can we offer modules in lieu of a
native thread id or scoreboard index for stuff like this?

Re: async write completion prototype

Posted by Brian Pane <br...@apache.org>.
On Oct 10, 2005, at 5:15 PM, Greg Ames wrote:

> - event-ize lingering close.  it eats up roughly the same number of  
> worker threads as synchronous writes for SPECweb99.

Is this because the lingering close is waiting a while for the client to
close the inbound side of the connection?  Or is the lingering close
finding that the connection is closed as soon as it does the "wait for
I/O or timeout"--meaning that the reason the server spends a lot of
time in lingering close is simply that the code for lingering close
(poll+read+close plus various setsockopt calls) takes a while?

Brian


Re: async write completion prototype

Posted by Greg Ames <gr...@apache.org>.
Brian Pane wrote:
> With the batch of commits I did this weekend, the Event MPM in
> the async-dev Subversion branch now does write completion
> in a nonblocking manner.  

very cool!

> There are several more things that need to be fixed in order
> to make the asynchronous write completion useful in a
> production release of httpd-2.x:

...

> - The logic for starting more child processes, which Event
>   inherited from Worker, is based on assumptions about
>   the number of concurrent connections being equal to
>   the number of threads.  These assumptions aren't valid
>   for a multiple-connections-per-thread MPM.

certainly the name MaxClients is wrong - it's really MaxWorkerThreads 
for event.  but the logic does a pretty good job of managing the threads 
if you can get past the name.

> - Similarly, there may be some changes needed in the
>   flow control logic that the listener thread uses to decide
>   whether it can do an accept.

the flow control I'm aware of is that ap_queue_info_wait_for_idler 
blocks, therefore the listener temporarily quits accept()ing until 
worker threads are available.  clearly that is needed.  the question is 
should there be some other cap on the number of connections per process.

if we do a really good job of raising the connections per thread ratio 
and continue to use ThreadsPerChild as the throttle, we will be bumping 
against OS file descriptor per process limits more often.  that sounds 
kinda ugly, so I think we will want some kind of MaxConnectionsPerChild.

if the async write completions are moved to the listener thread as Paul 
suggests, we might want another flow control change.  there's no need to 
reserve/block for a worker thread in that case.

> - The scoreboard probably needs a redesign.

yep.  Jeff T and I discussed this offline a while back.  a scoreboard 
slot per connection definately has some appeal.

...

my list:

- make it work with mod_ssl and http pipelining. this 
http://mail-archives.apache.org/mod_mbox/httpd-dev/200411.mbox/%3C4186E563.9070202@remulak.net%3E 
fixes it in theory.  my problem is testing/verifying it.  if I knew how 
to make mod_ssl's input filters stash data it shouldn't be too bad.

- event-ize lingering close.  it eats up roughly the same number of 
worker threads as synchronous writes for SPECweb99.

Greg