You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Brian Pane <br...@apache.org> on 2005/10/10 08:50:39 UTC
async write completion prototype
With the batch of commits I did this weekend, the Event MPM in
the async-dev Subversion branch now does write completion
in a nonblocking manner. Once an entire response has been
generated and passed to the output filter chain, the MPM's
poller/listener thread watches the connection for writability
events. When the connection becomes writable, the poller
thread sends it to one of the worker threads, which writes
some more output.
At this point, the event-handling code is ready for testing and
review by other developers.
The main changes on the async-dev branch (compared
to the 2.3-dev trunk) are:
1. ap_core_output_filter: rewrite to do nonblocking writes
whenever possible.
2. core, http module, and mod_logio: removed the generation
of flush buckets where possible.
3. request cleanup and logging: the logger phase and
subsequent destruction of the request's pool are now
triggered by the destruction of an End-Of-Request
bucket in the core output filter.
4. event MPM: asynchronous handling of CONN_STATE_WRITE_COMPLETION.
There are several more things that need to be fixed in order
to make the asynchronous write completion useful in a
production release of httpd-2.x:
- The main pollset in the Event MPM currently is sized to
hold up to one socket descriptor per worker thread. With
asynchronous keepalives and write completion, the pollset
should accommodate many descriptors per thread.
- The logic for starting more child processes, which Event
inherited from Worker, is based on assumptions about
the number of concurrent connections being equal to
the number of threads. These assumptions aren't valid
for a multiple-connections-per-thread MPM.
- Similarly, there may be some changes needed in the
flow control logic that the listener thread uses to decide
whether it can do an accept.
- The scoreboard probably needs a redesign.
- It may be valuable to have a separate thread pool to
run handlers that do arbitrarily lengthy processing, such
as mod_perl and mod_php.
Brian
Re: async write completion prototype
Posted by Phillip Susi <ps...@cfl.rr.com>.
Non blocking is not async IO. It is not really possible to perform zero
copy IO with non blocking IO semantics, you must have full async IO to
issue multiple pending requests.
Brian Akins wrote:
> Phillip Susi wrote:
>
>> As an alternative, you can bypass the cache and do direct async IO to
>> the disk with zero copies. IIRC, this is supported on linux with the
>> O_DIRECT flag. Doing this though, means that you will need to handle
>> caching yourself, which might not be such a good idea. Does Linux not
>> support O_DIRECT on sockets?
>
>
> Can you not just set the socket to non-blocking using O_NONBLOCK?
>
Re: async write completion prototype
Posted by Brian Akins <br...@turner.com>.
Phillip Susi wrote:
> As an alternative, you can bypass the cache and do direct async IO to
> the disk with zero copies. IIRC, this is supported on linux with the
> O_DIRECT flag. Doing this though, means that you will need to handle
> caching yourself, which might not be such a good idea. Does Linux not
> support O_DIRECT on sockets?
Can you not just set the socket to non-blocking using O_NONBLOCK?
--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies
Re: async write completion prototype
Posted by Phillip Susi <ps...@cfl.rr.com>.
On NT you can set the kernel buffer size on the socket to 0 ( with
setockopt() or ioctlsocket()? ), and the NIC can DMA directly from the
user buffers to send rather than copy to kernel space. This of course,
requires that you keep more than one pending async operation so the nic
allways has a buffer availible so it can keep the line saturated. If
you memory map the source file from the disk, then zero copy IO can be
done entirely from user space. Is this optimization not availible on
linux or freebsd?
As an alternative, you can bypass the cache and do direct async IO to
the disk with zero copies. IIRC, this is supported on linux with the
O_DIRECT flag. Doing this though, means that you will need to handle
caching yourself, which might not be such a good idea. Does Linux not
support O_DIRECT on sockets?
By using this technique I have been able to achieve TransmitFile()
performance levels entirely from user space, without any of the
drawbacks of TransmitFile(). Specifically virtually zero cpu time is
needed to saturate multiple 100 MBps links, pushing 11,820 KB/s,
progress is known the entire time and can be canceled at any time, and a
small handfull of threads can service thousonds of clients.
Paul Querna wrote:
> Phillip Susi wrote:
>
>> On what OS? Linux? NT supports async IO on sockets rather nicely, as
>> does FreeBSD iirc.
>
>
> The event MPM doesn't run on NT at all, only Unixes.
>
> Yes, FreeBSD (and linux) support async_write().. But this requires that
> you read the file off of disk, and into a buffer, and then copy it back
> into the kernel. With a non-blocking sendfile, we can avoid all the
> data copying (aka 'zero copy'), and let the kernel do everything itself.
>
> There is currently no such thing as 'async sendfile', which would be the
> perfect solution for this use case. There have been various people
> mentioning it as an idea, but no one has gone out and done it.
>
>
> -Paul
>
Re: async write completion prototype
Posted by Paul Querna <ch...@force-elite.com>.
Phillip Susi wrote:
> On what OS? Linux? NT supports async IO on sockets rather nicely, as
> does FreeBSD iirc.
The event MPM doesn't run on NT at all, only Unixes.
Yes, FreeBSD (and linux) support async_write().. But this requires that
you read the file off of disk, and into a buffer, and then copy it back
into the kernel. With a non-blocking sendfile, we can avoid all the
data copying (aka 'zero copy'), and let the kernel do everything itself.
There is currently no such thing as 'async sendfile', which would be the
perfect solution for this use case. There have been various people
mentioning it as an idea, but no one has gone out and done it.
-Paul
Re: async write completion prototype
Posted by Phillip Susi <ps...@cfl.rr.com>.
On what OS? Linux? NT supports async IO on sockets rather nicely, as
does FreeBSD iirc.
Paul Querna wrote:
> Phillip Susi wrote:
>
>> Nicely done. Have you done any benchmarking to see if this improved
>> performance as one would expect? Would it be much more work to use
>> true async IO instead of non blocking IO and polling? What about
>> doing the same for reads, as well as writes?
>>
>
> All current async_io methods require you to read the data off disk, ie
> there is no async_sendfile()....
>
> Reads are much harder in the current httpd. There are several core
> functions that would need to be rewritten first.
>
> -Paul
>
Re: async write completion prototype
Posted by Paul Querna <ch...@force-elite.com>.
Phillip Susi wrote:
> Nicely done. Have you done any benchmarking to see if this improved
> performance as one would expect? Would it be much more work to use true
> async IO instead of non blocking IO and polling? What about doing the
> same for reads, as well as writes?
>
All current async_io methods require you to read the data off disk, ie
there is no async_sendfile()....
Reads are much harder in the current httpd. There are several core
functions that would need to be rewritten first.
-Paul
Re: async write completion prototype
Posted by Phillip Susi <ps...@cfl.rr.com>.
Nicely done. Have you done any benchmarking to see if this improved
performance as one would expect? Would it be much more work to use true
async IO instead of non blocking IO and polling? What about doing the
same for reads, as well as writes?
Brian Pane wrote:
> With the batch of commits I did this weekend, the Event MPM in
> the async-dev Subversion branch now does write completion
> in a nonblocking manner. Once an entire response has been
> generated and passed to the output filter chain, the MPM's
> poller/listener thread watches the connection for writability
> events. When the connection becomes writable, the poller
> thread sends it to one of the worker threads, which writes
> some more output.
>
> At this point, the event-handling code is ready for testing and
> review by other developers.
>
> The main changes on the async-dev branch (compared
> to the 2.3-dev trunk) are:
>
> 1. ap_core_output_filter: rewrite to do nonblocking writes
> whenever possible.
>
> 2. core, http module, and mod_logio: removed the generation
> of flush buckets where possible.
>
> 3. request cleanup and logging: the logger phase and
> subsequent destruction of the request's pool are now
> triggered by the destruction of an End-Of-Request
> bucket in the core output filter.
>
> 4. event MPM: asynchronous handling of CONN_STATE_WRITE_COMPLETION.
>
> There are several more things that need to be fixed in order
> to make the asynchronous write completion useful in a
> production release of httpd-2.x:
>
> - The main pollset in the Event MPM currently is sized to
> hold up to one socket descriptor per worker thread. With
> asynchronous keepalives and write completion, the pollset
> should accommodate many descriptors per thread.
>
> - The logic for starting more child processes, which Event
> inherited from Worker, is based on assumptions about
> the number of concurrent connections being equal to
> the number of threads. These assumptions aren't valid
> for a multiple-connections-per-thread MPM.
>
> - Similarly, there may be some changes needed in the
> flow control logic that the listener thread uses to decide
> whether it can do an accept.
>
> - The scoreboard probably needs a redesign.
>
> - It may be valuable to have a separate thread pool to
> run handlers that do arbitrarily lengthy processing, such
> as mod_perl and mod_php.
>
> Brian
>
Re: async write completion prototype
Posted by Greg Ames <gr...@apache.org>.
Greg Ames wrote:
> this is interesting to me because Brian Atkins recently reported that
s/Atkins/Akins/ sorry, Brian
Greg
Re: async write completion prototype
Posted by Brian Akins <br...@turner.com>.
Greg Ames wrote:
> do you recall if CPU cycles were maxed out in both cases?
Yes.
--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies
Re: async write completion prototype
Posted by Greg Ames <gr...@apache.org>.
Brian Akins wrote:
> Basically, I was referring to the overall hits a box could serve per
> second.
>
> with 512 concurrent connections and about an 8k file, 2.1 with worker
> served about 22k request/second. event served about 14k.
do you recall if CPU cycles were maxed out in both cases?
thanks,
Greg
Re: async write completion prototype
Posted by Paul Querna <ch...@force-elite.com>.
Brian Pane wrote:
> On Oct 18, 2005, at 7:11 AM, Greg Ames wrote:
>
>> Brian Pane wrote:
>>
>>> I think one contributor to the event results is an issue that Paul
>>> Querna
>>> pointed out on #httpd-dev the other day: apr_pollset_remove runs in O(n)
>>> time with n descriptors in the pollset.
>>>
>>
>> thanks, I see it. yeah we are going to have to do something about that.
>
> I just committed a change to the epoll version that eliminates the
> O(n) loop--and the mutex operations and a bit of data structure
> copying.
Awesome! I really like it, a very nice addition to apr_pollset. I will
try to update APR with KQueue support on Sunday.
> The version of the Event MPM on the async-dev branch takes
> advantage of this new feature. I'm seeing a ~5% increase in
> throughput in a simple test setup (http_load on a client machine
> driving ~200 concurrent connections over 1Gb/s ethernet to
> Apache running on Linux 2.6).
>
> If anybody with a more industrial-strength load testing setup
> can try the async-dev version of the Event MPM with a few
> thousand concurrent connections, I'm eager to hear whether
> this new epoll code yields a useful speedup.
I agree, I have had problems in the past telling if any changes to the
Event MPM have good or bad performance implications. Some of it is best
guess, but it really would be nice to have a semi-reliable way to
benchmark it that included Keep Alive connections.
-Paul
Re: async write completion prototype
Posted by Brian Pane <br...@apache.org>.
On Oct 18, 2005, at 7:11 AM, Greg Ames wrote:
> Brian Pane wrote:
>
>> I think one contributor to the event results is an issue that
>> Paul Querna
>> pointed out on #httpd-dev the other day: apr_pollset_remove runs
>> in O(n)
>> time with n descriptors in the pollset.
>>
>
> thanks, I see it. yeah we are going to have to do something about
> that.
I just committed a change to the epoll version that eliminates the
O(n) loop--and the mutex operations and a bit of data structure
copying.
The version of the Event MPM on the async-dev branch takes
advantage of this new feature. I'm seeing a ~5% increase in
throughput in a simple test setup (http_load on a client machine
driving ~200 concurrent connections over 1Gb/s ethernet to
Apache running on Linux 2.6).
If anybody with a more industrial-strength load testing setup
can try the async-dev version of the Event MPM with a few
thousand concurrent connections, I'm eager to hear whether
this new epoll code yields a useful speedup.
Thanks,
Brian
Re: async write completion prototype
Posted by Greg Ames <gr...@apache.org>.
Brian Pane wrote:
> I think one contributor to the event results is an issue that Paul Querna
> pointed out on #httpd-dev the other day: apr_pollset_remove runs in O(n)
> time with n descriptors in the pollset.
thanks, I see it. yeah we are going to have to do something about that.
Greg
Re: async write completion prototype
Posted by Brian Pane <br...@apache.org>.
I think one contributor to the event results is an issue that Paul
Querna
pointed out on #httpd-dev the other day: apr_pollset_remove runs in O(n)
time with n descriptors in the pollset.
Brian
On Oct 13, 2005, at 11:36 AM, Brian Akins wrote:
> Greg Ames wrote:
>
>
>> this is interesting to me because Brian Atkins recently reported
>> that the event MPM was much slower. http://mail-
>> archives.apache.org/mod_mbox/httpd-dev/200509.mbox/%
>> 3c43219161.3030102@web.turner.com%3e
>>
>
> No "t" in my last name :)
>
>
> Basically, I was referring to the overall hits a box could serve
> per second.
>
> with 512 concurrent connections and about an 8k file, 2.1 with
> worker served about 22k request/second. event served about 14k.
>
> It's been a while since I did the test, and I'm too busy for the
> next few days to re-run them.
>
>
> --
> Brian Akins
> Lead Systems Engineer
> CNN Internet Technologies
>
Re: async write completion prototype
Posted by Brian Akins <br...@turner.com>.
Greg Ames wrote:
> this is interesting to me because Brian Atkins recently reported that
> the event MPM was much slower.
> http://mail-archives.apache.org/mod_mbox/httpd-dev/200509.mbox/%3c43219161.3030102@web.turner.com%3e
No "t" in my last name :)
Basically, I was referring to the overall hits a box could serve per second.
with 512 concurrent connections and about an 8k file, 2.1 with worker
served about 22k request/second. event served about 14k.
It's been a while since I did the test, and I'm too busy for the next
few days to re-run them.
--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies
Re: async write completion prototype
Posted by Greg Ames <gr...@apache.org>.
Brian Pane wrote:
> On Oct 10, 2005, at 12:01 AM, Paul Querna wrote:
>> If the content has already been generated, why add the overhead of
>> the context switch/sending to another thread? Can't the same event
>> thread do a non-blocking write?
>>
>> Once it finishes writing, then yes, we do require a context-switch to
>> another thread to do logging/cleanup.
>>
>> I am mostly thinking about downloading a 1 gig file with the current
>> pattern against a slow client. A non-blocking write might only do
>> ~64k at a time, and causing 1 gig/64k context switches, which seems
>> less than optimal.
>
>
> If I had to choose, I'd rather do the context switches than devote a
> thread (and the associated stack space) to the connection until
> the writes are finished--especially if the server is delivering a
> thousand 1GB files to slow clients concurrently.
>
> However, it's probably possible to have _both_ a high ratio
> of connections to threads (for scalability) and a low ratio of
> context switches to megabytes delivered (for efficiency).
> The Event MPM currently has to do a lot of context switching
> because it detects events in one thread and processes them
> in another. If we add async write completion to the
> Leader/Followers MPM (or incorporate a leader/follower
> thread model into Event), it should reduce the context
> switches considerably.
this is interesting to me because Brian Atkins recently reported that
the event MPM was much slower.
http://mail-archives.apache.org/mod_mbox/httpd-dev/200509.mbox/%3c43219161.3030102@web.turner.com%3e
it would be nice to hear more details, but I assume that this means
event is burning more CPU for a given workload rather than some kind of
extra latency bug. we know that event has more context switching than
worker when keepalives are in use but pipelining is not, and async write
completion will add to it. I suppose we should profile event and worker
and compare profiles in case there's some other unexpected CPU burner
out there.
if context switch overhead is really the culprit, how do we reduce it?
if I recall correctly, leader/follower sort of plays tag and the next
thread that's It gets to be the listener. I can see that running the
request processing on the same thread that does the accept would be more
cache friendly, and it might save some of the current queuing logic.
but doesn't this have about the same amount of pthread library/scheduler
overhead to "tag" the new listener and dispatch it as we have now waking
up worker threads?
another brainstorm is to use a short keepalive timeout, like 200ms*, on
the worker thread. if it pops, turn the connection over to the event
pollset using the remaining KeepAliveTimeout and give up the worker
thread.
Greg
*200ms - the idea is to use something just big enough to cover most
network round trip times, so we catch the case where the browser sends
the next request immediately after getting our response.
Re: async write completion prototype
Posted by Brian Pane <br...@apache.org>.
On Oct 10, 2005, at 12:01 AM, Paul Querna wrote:
> Brian Pane wrote:
>
>> With the batch of commits I did this weekend, the Event MPM in
>> the async-dev Subversion branch now does write completion
>> in a nonblocking manner. Once an entire response has been
>> generated and passed to the output filter chain, the MPM's
>> poller/listener thread watches the connection for writability
>> events. When the connection becomes writable, the poller
>> thread sends it to one of the worker threads, which writes
>> some more output.
>>
>
> If the content has already been generated, why add the overhead of
> the context switch/sending to another thread? Can't the same event
> thread do a non-blocking write?
>
> Once it finishes writing, then yes, we do require a context-switch
> to another thread to do logging/cleanup.
>
> I am mostly thinking about downloading a 1 gig file with the
> current pattern against a slow client. A non-blocking write might
> only do ~64k at a time, and causing 1 gig/64k context switches,
> which seems less than optimal.
If I had to choose, I'd rather do the context switches than devote a
thread (and the associated stack space) to the connection until
the writes are finished--especially if the server is delivering a
thousand 1GB files to slow clients concurrently.
However, it's probably possible to have _both_ a high ratio
of connections to threads (for scalability) and a low ratio of
context switches to megabytes delivered (for efficiency).
The Event MPM currently has to do a lot of context switching
because it detects events in one thread and processes them
in another. If we add async write completion to the
Leader/Followers MPM (or incorporate a leader/follower
thread model into Event), it should reduce the context
switches considerably.
> ...
>
>> - The main pollset in the Event MPM currently is sized to
>> hold up to one socket descriptor per worker thread. With
>> asynchronous keepalives and write completion, the pollset
>> should accommodate many descriptors per thread.
>>
>
> The pollset is auto-resizable. That number is just the maximum
> number of events that will ever be returned by a single call to
> _poll(). This number if perfect for the number of threads, since
> we can never dispatch to more than the number of threads we have...
Ah, thanks. I missed that key point. It makes more sense now.
Brian
Re: async write completion prototype
Posted by Paul Querna <ch...@force-elite.com>.
Brian Pane wrote:
> With the batch of commits I did this weekend, the Event MPM in
> the async-dev Subversion branch now does write completion
> in a nonblocking manner. Once an entire response has been
> generated and passed to the output filter chain, the MPM's
> poller/listener thread watches the connection for writability
> events. When the connection becomes writable, the poller
> thread sends it to one of the worker threads, which writes
> some more output.
If the content has already been generated, why add the overhead of the
context switch/sending to another thread? Can't the same event thread
do a non-blocking write?
Once it finishes writing, then yes, we do require a context-switch to
another thread to do logging/cleanup.
I am mostly thinking about downloading a 1 gig file with the current
pattern against a slow client. A non-blocking write might only do ~64k
at a time, and causing 1 gig/64k context switches, which seems less than
optimal.
...
> - The main pollset in the Event MPM currently is sized to
> hold up to one socket descriptor per worker thread. With
> asynchronous keepalives and write completion, the pollset
> should accommodate many descriptors per thread.
The pollset is auto-resizable. That number is just the maximum number
of events that will ever be returned by a single call to _poll(). This
number if perfect for the number of threads, since we can never dispatch
to more than the number of threads we have...
> - The scoreboard probably needs a redesign.
Yes.. it is completely unhelpful with the maximum number of connections
being > number of threads. I suspect going forward there will be many
other areas where this assumption is broken by the event mpm.
-Paul
Re: async write completion prototype
Posted by Jeff Trawick <tr...@gmail.com>.
On 10/10/05, Greg Ames <gr...@apache.org> wrote:
> > - The scoreboard probably needs a redesign.
>
> yep. Jeff T and I discussed this offline a while back. a scoreboard
> slot per connection definately has some appeal.
yes, the server needs a way to track all active connections in a
visible manner... a mod_status report needs to show all active
work...
another concern is a module's own thread-tracking... an example is
mod_whatkilledus... in a threaded MPM, it tracks active requests by
pthread id and, after a crash, retrieves the pthread id again to see
what the active request was... what can we offer modules in lieu of a
native thread id or scoreboard index for stuff like this?
Re: async write completion prototype
Posted by Brian Pane <br...@apache.org>.
On Oct 10, 2005, at 5:15 PM, Greg Ames wrote:
> - event-ize lingering close. it eats up roughly the same number of
> worker threads as synchronous writes for SPECweb99.
Is this because the lingering close is waiting a while for the client to
close the inbound side of the connection? Or is the lingering close
finding that the connection is closed as soon as it does the "wait for
I/O or timeout"--meaning that the reason the server spends a lot of
time in lingering close is simply that the code for lingering close
(poll+read+close plus various setsockopt calls) takes a while?
Brian
Re: async write completion prototype
Posted by Greg Ames <gr...@apache.org>.
Brian Pane wrote:
> With the batch of commits I did this weekend, the Event MPM in
> the async-dev Subversion branch now does write completion
> in a nonblocking manner.
very cool!
> There are several more things that need to be fixed in order
> to make the asynchronous write completion useful in a
> production release of httpd-2.x:
...
> - The logic for starting more child processes, which Event
> inherited from Worker, is based on assumptions about
> the number of concurrent connections being equal to
> the number of threads. These assumptions aren't valid
> for a multiple-connections-per-thread MPM.
certainly the name MaxClients is wrong - it's really MaxWorkerThreads
for event. but the logic does a pretty good job of managing the threads
if you can get past the name.
> - Similarly, there may be some changes needed in the
> flow control logic that the listener thread uses to decide
> whether it can do an accept.
the flow control I'm aware of is that ap_queue_info_wait_for_idler
blocks, therefore the listener temporarily quits accept()ing until
worker threads are available. clearly that is needed. the question is
should there be some other cap on the number of connections per process.
if we do a really good job of raising the connections per thread ratio
and continue to use ThreadsPerChild as the throttle, we will be bumping
against OS file descriptor per process limits more often. that sounds
kinda ugly, so I think we will want some kind of MaxConnectionsPerChild.
if the async write completions are moved to the listener thread as Paul
suggests, we might want another flow control change. there's no need to
reserve/block for a worker thread in that case.
> - The scoreboard probably needs a redesign.
yep. Jeff T and I discussed this offline a while back. a scoreboard
slot per connection definately has some appeal.
...
my list:
- make it work with mod_ssl and http pipelining. this
http://mail-archives.apache.org/mod_mbox/httpd-dev/200411.mbox/%3C4186E563.9070202@remulak.net%3E
fixes it in theory. my problem is testing/verifying it. if I knew how
to make mod_ssl's input filters stash data it shouldn't be too bad.
- event-ize lingering close. it eats up roughly the same number of
worker threads as synchronous writes for SPECweb99.
Greg