You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Brian Pane <br...@apache.org> on 2002/09/01 01:54:20 UTC

Bucket management strategies for async MPMs?

I've been thinking about strategies for building a
multiple-connection-per-thread MPM for 2.0.  It's
conceptually easy to do this:

  * Start with worker.

  * Keep the model of one worker thread per request,
    so that blocking or CPU-intensive modules don't
    need to be rewritten as state machines.

  * In the core output filter, instead of doing
    actual socket writes, hand off the output
    brigades to a "writer thread."

  * As soon as the worker thread has sent an EOS
    to the writer thread, let the worker thread
    move on to the next request.

  * In the writer thread, use a big event loop
    (with /dev/poll or RT signals or kqueue, depending
    on platform) to do nonblocking writes for all
    open connections.

This would allow us to use a much smaller number of
worker threads for the same amount amount of traffic
(at least for typical workloads in which the network
write time constitutes the majority of each requests's
duration).

The problem, though, is that passing brigades between
threads is unsafe:

  * The bucket allocator alloc/free code isn't
    thread-safe, so bad things will happen if the
    writer thread tries to free a bucket (that's
    just been written to the client) at the same
    time that a worker thread is allocating a new
    bucket for a subsequent request on the same
    connection.

  * If we delete the request pool when the worker
    thread finishes its work on the request, the
    pool cleanup will close the underlying objects
    for the request's file/pipe/mmap/etc buckets.
    When the writer thread tries to output these
    buckets, the writes will fail.

There are other ways to structure an async MPM, but
in almost all cases we'll face the same problem:
buckets that get created by one thread must be
delivered and then freed by a different thread, and
the current memory management design can't handle
that.

The cleanest solution I've thought of so far is:

  * Modify the bucket allocator code to allow
    thread-safe alloc/free of buckets.  For the
    common cases, it should be possible to do
    this without mutexes by using apr_atomic_cas()
    based spin loops.  (There will be at most two
    threads contending for the same allocator--
    one worker thread and the writer thread--so
    the amount of spinning should be minimal.)

  * Don't delete the request pool at the end of
    a request.  Instead, delay its deletion until
    the last bucket from that request is sent.
    One way to do this is to create a new metadata
    bucket type that stores the pointer to the
    request pool.  The worker thread can append
    this metadata bucket to the output brigade,
    right before the EOS.  The writer thread then
    reads the metadata bucket and deletes (or
    clears and recycles) the referenced pool after
    sending the response.  This would mean, however,
    that the request pool couldn't be a subpool of
    the connection pool.  The writer thread would have
    to be careful to clean up the request pool(s)
    upon connection abort.

I'm eager to hear comments from others who have looked
at the async design issues.

Thanks,
Brian

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Paul J. Reder wrote:

>
>
> Brian Pane wrote:
>
>> I've been thinking about strategies for building a
>> multiple-connection-per-thread MPM for 2.0.  It's
>> conceptually easy to do this:
>>
>>  * Start with worker.
>>
>>  * Keep the model of one worker thread per request,
>>    so that blocking or CPU-intensive modules don't
>>    need to be rewritten as state machines.
>>
>>  * In the core output filter, instead of doing
>>    actual socket writes, hand off the output
>>    brigades to a "writer thread."
>
>
>
> During a discusion today, the idea came up to have the
> code check if it could be written directly instead of
> always passing it to the writer. If the whole response
> is present and can be successfully written, why not save
> the overhead. If the write fails, or the response is too
> complex, then pass it over to the writer.


+1.  In cases where the entire file can be
delivered in one call to sendfile/sendfilev,
all we'll have to do in the writer thread is
close the connection once the write completes.

>
>
>>
>>  * As soon as the worker thread has sent an EOS
>>    to the writer thread, let the worker thread
>>    move on to the next request.
>
>
>
> I have a small concern here. Right now the writes are
> providing the throttle that keeps the system from generating
> so much queued output that we burn system resources. If
> we allow workers to generate responses without a throttle,
> it seems possible that the writer's queue will grow to the
> point that the system starts running out of resources.


Right.  The solution I'd been thinking of is
a variant of the current worker's "queue_info"
struct: a central structure for process that
keeps a count of open connections.  The listener
thread increments this counter every time it does
an accept, and the writer thread decrements it
every time a connection completes.  If the count
reaches a configured maximum, the listener blocks
on a condition variable until the writer closes
at least one current connection and wakes up
the listener.

Brian

Re: Bucket management strategies for async MPMs?

Posted by Ian Holsman <ia...@apache.org>.

Paul J. Reder wrote:
> 
> 
> Brian Pane wrote:
> 
>> I've been thinking about strategies for building a
>> multiple-connection-per-thread MPM for 2.0.  It's
>> conceptually easy to do this:
>>
>>  * Start with worker.
>>
>>  * Keep the model of one worker thread per request,
>>    so that blocking or CPU-intensive modules don't
>>    need to be rewritten as state machines.
>>
>>  * In the core output filter, instead of doing
>>    actual socket writes, hand off the output
>>    brigades to a "writer thread."
> 
> 
> 
> During a discusion today, the idea came up to have the
> code check if it could be written directly instead of
> always passing it to the writer. If the whole response
> is present and can be successfully written, why not save
> the overhead. If the write fails, or the response is too
> complex, then pass it over to the writer.
> 
> 
>>
>>  * As soon as the worker thread has sent an EOS
>>    to the writer thread, let the worker thread
>>    move on to the next request.
> 
> 
> 
> I have a small concern here. Right now the writes are
> providing the throttle that keeps the system from generating
> so much queued output that we burn system resources. If
> we allow workers to generate responses without a throttle,
> it seems possible that the writer's queue will grow to the
> point that the system starts running out of resources.
> 
maybe if we something like the queue (apr-util/misc/apr_queue.c)
to submit the write requests, we could limit the number of outstanding 
writes to X, with the threads sleeping when the queue gets full.

I'm actually working on a dynamicly growing thread pools which would 
read the queue, and adjust the number of threads based on the size of 
the queue (eventually I want to adjust the number of threads based on
the response time)

If anyone is interested (currently buggy) code, I'll put it up on 
webperf somewhere
> Only testing will show for sure, and maybe in the real world
> it would only happen for brief periods of heavy load, but it
> seems like we need some sort of writer queue thresholding
> with pushback to control worker throughput.
> 
> Of course, if we do add a throttle for the workers, then how
> does this really improve things? The writer was the throttle
> before and it would be again. We've added an extra queue so
> there will be a period of increased worker output until the
> queue threshold is met but, once the queue is filled, we revert
> to the writer being the throttle. The workers cannot finish
> their current response until the writer has finished writing
> a queued response and freed up a queue slot.
> 
>>
>>  * In the writer thread, use a big event loop
>>    (with /dev/poll or RT signals or kqueue, depending
>>    on platform) to do nonblocking writes for all
>>    open connections.
>>
>> This would allow us to use a much smaller number of
>> worker threads for the same amount amount of traffic
>> (at least for typical workloads in which the network
>> write time constitutes the majority of each requests's
>> duration).
>>
>> The problem, though, is that passing brigades between
>> threads is unsafe:
>>
>>  * The bucket allocator alloc/free code isn't
>>    thread-safe, so bad things will happen if the
>>    writer thread tries to free a bucket (that's
>>    just been written to the client) at the same
>>    time that a worker thread is allocating a new
>>    bucket for a subsequent request on the same
>>    connection.
>>
>>  * If we delete the request pool when the worker
>>    thread finishes its work on the request, the
>>    pool cleanup will close the underlying objects
>>    for the request's file/pipe/mmap/etc buckets.
>>    When the writer thread tries to output these
>>    buckets, the writes will fail.
>>
>> There are other ways to structure an async MPM, but
>> in almost all cases we'll face the same problem:
>> buckets that get created by one thread must be
>> delivered and then freed by a different thread, and
>> the current memory management design can't handle
>> that.
>>
>> The cleanest solution I've thought of so far is:
>>
>>  * Modify the bucket allocator code to allow
>>    thread-safe alloc/free of buckets.  For the
>>    common cases, it should be possible to do
>>    this without mutexes by using apr_atomic_cas()
>>    based spin loops.  (There will be at most two
>>    threads contending for the same allocator--
>>    one worker thread and the writer thread--so
>>    the amount of spinning should be minimal.)
>>
>>  * Don't delete the request pool at the end of
>>    a request.  Instead, delay its deletion until
>>    the last bucket from that request is sent.
>>    One way to do this is to create a new metadata
>>    bucket type that stores the pointer to the
>>    request pool.  The worker thread can append
>>    this metadata bucket to the output brigade,
>>    right before the EOS.  The writer thread then
>>    reads the metadata bucket and deletes (or
>>    clears and recycles) the referenced pool after
>>    sending the response.  This would mean, however,
>>    that the request pool couldn't be a subpool of
>>    the connection pool.  The writer thread would have
>>    to be careful to clean up the request pool(s)
>>    upon connection abort.
>>
>> I'm eager to hear comments from others who have looked
>> at the async design issues.
>>
>> Thanks,
>> Brian
>>
>>
>>
> 
>

Re: Bucket management strategies for async MPMs?

Posted by "Paul J. Reder" <re...@remulak.net>.


Brian Pane wrote:

> I've been thinking about strategies for building a
> multiple-connection-per-thread MPM for 2.0.  It's
> conceptually easy to do this:
> 
>  * Start with worker.
> 
>  * Keep the model of one worker thread per request,
>    so that blocking or CPU-intensive modules don't
>    need to be rewritten as state machines.
> 
>  * In the core output filter, instead of doing
>    actual socket writes, hand off the output
>    brigades to a "writer thread."


During a discusion today, the idea came up to have the
code check if it could be written directly instead of
always passing it to the writer. If the whole response
is present and can be successfully written, why not save
the overhead. If the write fails, or the response is too
complex, then pass it over to the writer.


> 
>  * As soon as the worker thread has sent an EOS
>    to the writer thread, let the worker thread
>    move on to the next request.


I have a small concern here. Right now the writes are
providing the throttle that keeps the system from generating
so much queued output that we burn system resources. If
we allow workers to generate responses without a throttle,
it seems possible that the writer's queue will grow to the
point that the system starts running out of resources.

Only testing will show for sure, and maybe in the real world
it would only happen for brief periods of heavy load, but it
seems like we need some sort of writer queue thresholding
with pushback to control worker throughput.

Of course, if we do add a throttle for the workers, then how
does this really improve things? The writer was the throttle
before and it would be again. We've added an extra queue so
there will be a period of increased worker output until the
queue threshold is met but, once the queue is filled, we revert
to the writer being the throttle. The workers cannot finish
their current response until the writer has finished writing
a queued response and freed up a queue slot.

> 
>  * In the writer thread, use a big event loop
>    (with /dev/poll or RT signals or kqueue, depending
>    on platform) to do nonblocking writes for all
>    open connections.
> 
> This would allow us to use a much smaller number of
> worker threads for the same amount amount of traffic
> (at least for typical workloads in which the network
> write time constitutes the majority of each requests's
> duration).
> 
> The problem, though, is that passing brigades between
> threads is unsafe:
> 
>  * The bucket allocator alloc/free code isn't
>    thread-safe, so bad things will happen if the
>    writer thread tries to free a bucket (that's
>    just been written to the client) at the same
>    time that a worker thread is allocating a new
>    bucket for a subsequent request on the same
>    connection.
> 
>  * If we delete the request pool when the worker
>    thread finishes its work on the request, the
>    pool cleanup will close the underlying objects
>    for the request's file/pipe/mmap/etc buckets.
>    When the writer thread tries to output these
>    buckets, the writes will fail.
> 
> There are other ways to structure an async MPM, but
> in almost all cases we'll face the same problem:
> buckets that get created by one thread must be
> delivered and then freed by a different thread, and
> the current memory management design can't handle
> that.
> 
> The cleanest solution I've thought of so far is:
> 
>  * Modify the bucket allocator code to allow
>    thread-safe alloc/free of buckets.  For the
>    common cases, it should be possible to do
>    this without mutexes by using apr_atomic_cas()
>    based spin loops.  (There will be at most two
>    threads contending for the same allocator--
>    one worker thread and the writer thread--so
>    the amount of spinning should be minimal.)
> 
>  * Don't delete the request pool at the end of
>    a request.  Instead, delay its deletion until
>    the last bucket from that request is sent.
>    One way to do this is to create a new metadata
>    bucket type that stores the pointer to the
>    request pool.  The worker thread can append
>    this metadata bucket to the output brigade,
>    right before the EOS.  The writer thread then
>    reads the metadata bucket and deletes (or
>    clears and recycles) the referenced pool after
>    sending the response.  This would mean, however,
>    that the request pool couldn't be a subpool of
>    the connection pool.  The writer thread would have
>    to be careful to clean up the request pool(s)
>    upon connection abort.
> 
> I'm eager to hear comments from others who have looked
> at the async design issues.
> 
> Thanks,
> Brian
> 
> 
> 


-- 
Paul J. Reder
-----------------------------------------------------------
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein

RE: Bucket management strategies for async MPMs?

Posted by Bill Stoddard <bi...@wstoddard.com>.

> I've been thinking about strategies for building a
> multiple-connection-per-thread MPM for 2.0.  It's
> conceptually easy to do this:
>
>   * Start with worker.
>
>   * Keep the model of one worker thread per request,
>     so that blocking or CPU-intensive modules don't
>     need to be rewritten as state machines.
>
>   * In the core output filter, instead of doing
>     actual socket writes, hand off the output
>     brigades to a "writer thread."

How about implementing the event loop in a filter above the
core_output_filter (in net_time_filter perhaps)?  We need to formalize the
rules for writing a filter to handle receiving APR_EAGAIN (or
APR_EWOULDBLOCK) and maybe even APR_IO_PENDING (for a true async io), so
lets start with core_output_filter. (I am thinking of the time when we
decide to move the event_filter higher in the filter stack, like before the
chunk filter for instance).

>
>   * As soon as the worker thread has sent an EOS
>     to the writer thread, let the worker thread
>     move on to the next request.
>
>   * In the writer thread, use a big event loop
>     (with /dev/poll or RT signals or kqueue, depending
>     on platform) to do nonblocking writes for all
>     open connections.
>
> This would allow us to use a much smaller number of
> worker threads for the same amount amount of traffic
> (at least for typical workloads in which the network
> write time constitutes the majority of each requests's
> duration).

This is the right (write:-) optimization.  Looking at server-status on
apache.org shows most processes are busy writing responses.

>
> The problem, though, is that passing brigades between
> threads is unsafe:
>
>   * The bucket allocator alloc/free code isn't
>     thread-safe, so bad things will happen if the
>     writer thread tries to free a bucket (that's
>     just been written to the client) at the same
>     time that a worker thread is allocating a new
>     bucket for a subsequent request on the same
>     connection.
>
>   * If we delete the request pool when the worker
>     thread finishes its work on the request, the
>     pool cleanup will close the underlying objects
>     for the request's file/pipe/mmap/etc buckets.
>     When the writer thread tries to output these
>     buckets, the writes will fail.
>
> There are other ways to structure an async MPM, but
> in almost all cases we'll face the same problem:
> buckets that get created by one thread must be
> delivered and then freed by a different thread, and
> the current memory management design can't handle
> that.
>
> The cleanest solution I've thought of so far is:
>
>   * Modify the bucket allocator code to allow
>     thread-safe alloc/free of buckets.  For the
>     common cases, it should be possible to do
>     this without mutexes by using apr_atomic_cas()
>     based spin loops.  (There will be at most two
>     threads contending for the same allocator--
>     one worker thread and the writer thread--so
>     the amount of spinning should be minimal.)
>
>   * Don't delete the request pool at the end of
>     a request.  Instead, delay its deletion until
>     the last bucket from that request is sent.
>     One way to do this is to create a new metadata
>     bucket type that stores the pointer to the
>     request pool.  The worker thread can append
>     this metadata bucket to the output brigade,
>     right before the EOS.  The writer thread then
>     reads the metadata bucket and deletes (or
>     clears and recycles) the referenced pool after
>     sending the response.  This would mean, however,
>     that the request pool couldn't be a subpool of
>     the connection pool.  The writer thread would have
>     to be careful to clean up the request pool(s)
>     upon connection abort.
>
> I'm eager to hear comments from others who have looked
> at the async design issues.
>

When a brigade containg an EOS is passed to the event loop by a worker
thread, ownership of the brigade and all it's buckets is passed to the event
loop and the worker should not even touch memory in any of the buckets. The
worker cannot help but touch the brigade though (as part of unwinding the
call chain).  The key here I think is to reference count the brigade (or a
struct that references the brigade). The brigade is freed by the thread that
drops the ref count to 0 (we reference count cache_objects in
mod_mem_cache.c to handle race conditions between threads serving the cache
object and other threads garbage collecting the same cache object). Ditto
clearing the ptrans pool, which implies that we should create a new ref
counted object (ap_work_t ?) that contains references to ptrans, the brigade
and anything else we see fit to stick there (like a buffer to do true async
reads). A simplification is not attempt to buffer up responses to pipelined
requests.

Supporting ED/async reads is a bit more difficult.  There are two
fundamental problems:

* The functions that kickoff network reads (ap_read_request and
ap_get_mime_headers_core) are written to expect a successful read and cannot
handle being reentered multiple times on the same request.  This can be
fixed relatively easily by implementing a simple state machine used by
ap_process_http_connection() and refactoring the code in ap_read_request and
ap_get_mime_headers to segregate network reads and the code that processes
the received bytes. This work needs to be done for event driven reads and
true async reads.

* ap_get_brigade and ap_bucket_read allocates buckets under the covers (deep
in the bucket code).  This is -really- bad for async i/o because async i/o
relies on the buffers passed to read not be freed out from under the read
(this is not an issue for an event driven read I think). One solution is to
reimplement ap_get_brigade and the bucket read code to accept a buffer (and
len field) supplied by the application. This buffer would live in the
reference counted ap_work_t 'work element' mentioned earlier.

I would like to see all reads to be done ED/async, handlers to make the
decision on whether they want to do ED/async writes, and filters to handle
receiving APR_EAGAIN and APR_IO_PENDING correctly. Enabling writes to be
ED/async is a good first step.

Bill

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Cliff Woolley wrote:

>On Sat, 31 Aug 2002, Brian Pane wrote:
>
>  
>
>>Wouldn't it be sufficient to guarantee that:
>> * each *bucket* can only be processed by one thread at a time, and
>> * allocating/freeing buckets is thread-safe?
>>    
>>
>
>No.  You'd need to also guarantee that all of the buckets sharing a
>private data structure (copies or splits of a single bucket) were, as a
>group, processed by only one thread at a time (and those buckets can exist
>across multiple brigades even).
>

I *think* this one can be solved by making the increment/decrement
of the bucket refcount atomic.

>You'd also have to guarantee that no
>buckets are added/removed from a given brigade by more than one thread at
>a time.
>

This part is easy to guarantee.  When the worker thread passes
buckets to the writer thread, it hands off a whole brigade at once,
so that ownership of the brigade passes from one thread to another.

>  When you add up the implications of all these things, it
>basically ends up with the whole request being in one thread at a time.
>  
>

If we can overcome this limitation, it will be straightforward
to build an async MPM.  If not, the fallback solution would be:

  * Each worker thread does its own network writes, up until the
    point where it sees EOS.
  * At that point, the worker thread hands the remaining brigade
    off to the writer thread.  (In doing so, it's basically transferring
    the entire request to the writer thread.)

This would give us the benefits of async writes for static files,
where the core_output_filter could immediately transfer the
response_header+file_bucket+EOS brigade to the writer thread and
let the worker thread go on to work on other requests.  For large
streamed responses, though, the worker would end up writing almost
the entire response.

Brian

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Cliff Woolley wrote:

>On Sat, 31 Aug 2002, Brian Pane wrote:
>
>  
>
>>Wouldn't it be sufficient to guarantee that:
>> * each *bucket* can only be processed by one thread at a time, and
>> * allocating/freeing buckets is thread-safe?
>>    
>>
>
>No.  You'd need to also guarantee that all of the buckets sharing a
>private data structure (copies or splits of a single bucket) were, as a
>group, processed by only one thread at a time (and those buckets can exist
>across multiple brigades even).
>

I *think* this one can be solved by making the increment/decrement
of the bucket refcount atomic.

>You'd also have to guarantee that no
>buckets are added/removed from a given brigade by more than one thread at
>a time.
>

This part is easy to guarantee.  When the worker thread passes
buckets to the writer thread, it hands off a whole brigade at once,
so that ownership of the brigade passes from one thread to another.

>  When you add up the implications of all these things, it
>basically ends up with the whole request being in one thread at a time.
>  
>

If we can overcome this limitation, it will be straightforward
to build an async MPM.  If not, the fallback solution would be:

  * Each worker thread does its own network writes, up until the
    point where it sees EOS.
  * At that point, the worker thread hands the remaining brigade
    off to the writer thread.  (In doing so, it's basically transferring
    the entire request to the writer thread.)

This would give us the benefits of async writes for static files,
where the core_output_filter could immediately transfer the
response_header+file_bucket+EOS brigade to the writer thread and
let the worker thread go on to work on other requests.  For large
streamed responses, though, the worker would end up writing almost
the entire response.

Brian

Re: Bucket management strategies for async MPMs?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 31 Aug 2002, Brian Pane wrote:

> Wouldn't it be sufficient to guarantee that:
>  * each *bucket* can only be processed by one thread at a time, and
>  * allocating/freeing buckets is thread-safe?

No.  You'd need to also guarantee that all of the buckets sharing a
private data structure (copies or splits of a single bucket) were, as a
group, processed by only one thread at a time (and those buckets can exist
across multiple brigades even).  You'd also have to guarantee that no
buckets are added/removed from a given brigade by more than one thread at
a time.  When you add up the implications of all these things, it
basically ends up with the whole request being in one thread at a time.

Re: Bucket management strategies for async MPMs?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 31 Aug 2002, Brian Pane wrote:

> Wouldn't it be sufficient to guarantee that:
>  * each *bucket* can only be processed by one thread at a time, and
>  * allocating/freeing buckets is thread-safe?

No.  You'd need to also guarantee that all of the buckets sharing a
private data structure (copies or splits of a single bucket) were, as a
group, processed by only one thread at a time (and those buckets can exist
across multiple brigades even).  You'd also have to guarantee that no
buckets are added/removed from a given brigade by more than one thread at
a time.  When you add up the implications of all these things, it
basically ends up with the whole request being in one thread at a time.

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Cliff Woolley wrote:

>On Sat, 31 Aug 2002, Brian Pane wrote:
>
>  
>
>>I don't think we can count on the assumption that each conn will
>>only be processed by one thread at a time.  For example, this race
>>    
>>
>
>Then we have to at least guarantee that each request can only be processed
>by one thread at a time, I think.  *None* of the buckets code is
>threadsafe, and it's done that way intentionally.  A brigade (and its
>allocator) can exist in exactly one thread at a time.
>  
>

Wouldn't it be sufficient to guarantee that:
 * each *bucket* can only be processed by one thread at a time, and
 * allocating/freeing buckets is thread-safe?

Brian

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Cliff Woolley wrote:

>On Sat, 31 Aug 2002, Brian Pane wrote:
>
>  
>
>>I don't think we can count on the assumption that each conn will
>>only be processed by one thread at a time.  For example, this race
>>    
>>
>
>Then we have to at least guarantee that each request can only be processed
>by one thread at a time, I think.  *None* of the buckets code is
>threadsafe, and it's done that way intentionally.  A brigade (and its
>allocator) can exist in exactly one thread at a time.
>  
>

Wouldn't it be sufficient to guarantee that:
 * each *bucket* can only be processed by one thread at a time, and
 * allocating/freeing buckets is thread-safe?

Brian

Re: Bucket management strategies for async MPMs?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 31 Aug 2002, Brian Pane wrote:

> I don't think we can count on the assumption that each conn will
> only be processed by one thread at a time.  For example, this race

Then we have to at least guarantee that each request can only be processed
by one thread at a time, I think.  *None* of the buckets code is
threadsafe, and it's done that way intentionally.  A brigade (and its
allocator) can exist in exactly one thread at a time.

--Cliff

Re: Bucket management strategies for async MPMs?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 31 Aug 2002, Brian Pane wrote:

> I don't think we can count on the assumption that each conn will
> only be processed by one thread at a time.  For example, this race

Then we have to at least guarantee that each request can only be processed
by one thread at a time, I think.  *None* of the buckets code is
threadsafe, and it's done that way intentionally.  A brigade (and its
allocator) can exist in exactly one thread at a time.

--Cliff

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Cliff Woolley wrote:

>On Sat, 31 Aug 2002, Brian Pane wrote:
>
>  
>
>>  * The bucket allocator alloc/free code isn't
>>    thread-safe, so bad things will happen if the
>>    writer thread tries to free a bucket (that's
>>    just been written to the client) at the same
>>    time that a worker thread is allocating a new
>>    bucket for a subsequent request on the same
>>    connection.
>>    
>>
>
>We designed with this in mind... basically what's supposed to happen is
>that rather than having a bucket allocator per thread you have a group of
>available bucket allocators, and you assign one to each new connection.
>Since each connection will be processed by at most one thread at a time,
>you're safe.  When the connection is closed, the allocator is placed back
>into the list of available allocators for reuse on future connections.
>  
>

I don't think we can count on the assumption that each conn will
only be processed by one thread at a time.  For example, this race
condition can happen on a keepalive connection with pipelined
requests:

    1. Listener thread accepts connection, obtains a bucket
       allocator, assigns the allocator to the connection.
    2. Worker thread reads the first request from the connection.
       It's a simple file request, so the worker thread creates
       a 3-bucket brigade (response header, file bucket, EOS)
       and sends it to the writer thead.
    3. Writer thread starts the sendfile/sendfilev
    4. Worker thread reads the second response from the connection.
       This one is a CGI or proxy request that takes a long time
       and results in a long stream of buckets.
    5. Meanwhile, the sendfile call completes.  The writer thread
       deletes the file bucket...at the same time that the worker
       thread is allocating another heap bucket.
    6. Segfault, etc, etc.

In fact, this could even happen without keepalives, if a long-running
proxy request is still producing new buckets for a response while the
writer thread is handling buckets earlier in the same response.

Brian

Re: Bucket management strategies for async MPMs?

Posted by Brian Pane <br...@apache.org>.

Cliff Woolley wrote:

>On Sat, 31 Aug 2002, Brian Pane wrote:
>
>  
>
>>  * The bucket allocator alloc/free code isn't
>>    thread-safe, so bad things will happen if the
>>    writer thread tries to free a bucket (that's
>>    just been written to the client) at the same
>>    time that a worker thread is allocating a new
>>    bucket for a subsequent request on the same
>>    connection.
>>    
>>
>
>We designed with this in mind... basically what's supposed to happen is
>that rather than having a bucket allocator per thread you have a group of
>available bucket allocators, and you assign one to each new connection.
>Since each connection will be processed by at most one thread at a time,
>you're safe.  When the connection is closed, the allocator is placed back
>into the list of available allocators for reuse on future connections.
>  
>

I don't think we can count on the assumption that each conn will
only be processed by one thread at a time.  For example, this race
condition can happen on a keepalive connection with pipelined
requests:

    1. Listener thread accepts connection, obtains a bucket
       allocator, assigns the allocator to the connection.
    2. Worker thread reads the first request from the connection.
       It's a simple file request, so the worker thread creates
       a 3-bucket brigade (response header, file bucket, EOS)
       and sends it to the writer thead.
    3. Writer thread starts the sendfile/sendfilev
    4. Worker thread reads the second response from the connection.
       This one is a CGI or proxy request that takes a long time
       and results in a long stream of buckets.
    5. Meanwhile, the sendfile call completes.  The writer thread
       deletes the file bucket...at the same time that the worker
       thread is allocating another heap bucket.
    6. Segfault, etc, etc.

In fact, this could even happen without keepalives, if a long-running
proxy request is still producing new buckets for a response while the
writer thread is handling buckets earlier in the same response.

Brian

Re: Bucket management strategies for async MPMs?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 31 Aug 2002, Brian Pane wrote:

>   * The bucket allocator alloc/free code isn't
>     thread-safe, so bad things will happen if the
>     writer thread tries to free a bucket (that's
>     just been written to the client) at the same
>     time that a worker thread is allocating a new
>     bucket for a subsequent request on the same
>     connection.

We designed with this in mind... basically what's supposed to happen is
that rather than having a bucket allocator per thread you have a group of
available bucket allocators, and you assign one to each new connection.
Since each connection will be processed by at most one thread at a time,
you're safe.  When the connection is closed, the allocator is placed back
into the list of available allocators for reuse on future connections.

--Cliff

Re: Bucket management strategies for async MPMs?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 31 Aug 2002, Brian Pane wrote:

>   * The bucket allocator alloc/free code isn't
>     thread-safe, so bad things will happen if the
>     writer thread tries to free a bucket (that's
>     just been written to the client) at the same
>     time that a worker thread is allocating a new
>     bucket for a subsequent request on the same
>     connection.

We designed with this in mind... basically what's supposed to happen is
that rather than having a bucket allocator per thread you have a group of
available bucket allocators, and you assign one to each new connection.
Since each connection will be processed by at most one thread at a time,
you're safe.  When the connection is closed, the allocator is placed back
into the list of available allocators for reuse on future connections.

--Cliff

RE: Bucket management strategies for async MPMs?

Posted by Bill Stoddard <bi...@wstoddard.com>.

> I've been thinking about strategies for building a
> multiple-connection-per-thread MPM for 2.0.  It's
> conceptually easy to do this:
>
>   * Start with worker.
>
>   * Keep the model of one worker thread per request,
>     so that blocking or CPU-intensive modules don't
>     need to be rewritten as state machines.
>
>   * In the core output filter, instead of doing
>     actual socket writes, hand off the output
>     brigades to a "writer thread."

How about implementing the event loop in a filter above the
core_output_filter (in net_time_filter perhaps)?  We need to formalize the
rules for writing a filter to handle receiving APR_EAGAIN (or
APR_EWOULDBLOCK) and maybe even APR_IO_PENDING (for a true async io), so
lets start with core_output_filter. (I am thinking of the time when we
decide to move the event_filter higher in the filter stack, like before the
chunk filter for instance).

>
>   * As soon as the worker thread has sent an EOS
>     to the writer thread, let the worker thread
>     move on to the next request.
>
>   * In the writer thread, use a big event loop
>     (with /dev/poll or RT signals or kqueue, depending
>     on platform) to do nonblocking writes for all
>     open connections.
>
> This would allow us to use a much smaller number of
> worker threads for the same amount amount of traffic
> (at least for typical workloads in which the network
> write time constitutes the majority of each requests's
> duration).

This is the right (write:-) optimization.  Looking at server-status on
apache.org shows most processes are busy writing responses.

>
> The problem, though, is that passing brigades between
> threads is unsafe:
>
>   * The bucket allocator alloc/free code isn't
>     thread-safe, so bad things will happen if the
>     writer thread tries to free a bucket (that's
>     just been written to the client) at the same
>     time that a worker thread is allocating a new
>     bucket for a subsequent request on the same
>     connection.
>
>   * If we delete the request pool when the worker
>     thread finishes its work on the request, the
>     pool cleanup will close the underlying objects
>     for the request's file/pipe/mmap/etc buckets.
>     When the writer thread tries to output these
>     buckets, the writes will fail.
>
> There are other ways to structure an async MPM, but
> in almost all cases we'll face the same problem:
> buckets that get created by one thread must be
> delivered and then freed by a different thread, and
> the current memory management design can't handle
> that.
>
> The cleanest solution I've thought of so far is:
>
>   * Modify the bucket allocator code to allow
>     thread-safe alloc/free of buckets.  For the
>     common cases, it should be possible to do
>     this without mutexes by using apr_atomic_cas()
>     based spin loops.  (There will be at most two
>     threads contending for the same allocator--
>     one worker thread and the writer thread--so
>     the amount of spinning should be minimal.)
>
>   * Don't delete the request pool at the end of
>     a request.  Instead, delay its deletion until
>     the last bucket from that request is sent.
>     One way to do this is to create a new metadata
>     bucket type that stores the pointer to the
>     request pool.  The worker thread can append
>     this metadata bucket to the output brigade,
>     right before the EOS.  The writer thread then
>     reads the metadata bucket and deletes (or
>     clears and recycles) the referenced pool after
>     sending the response.  This would mean, however,
>     that the request pool couldn't be a subpool of
>     the connection pool.  The writer thread would have
>     to be careful to clean up the request pool(s)
>     upon connection abort.
>
> I'm eager to hear comments from others who have looked
> at the async design issues.
>

When a brigade containg an EOS is passed to the event loop by a worker
thread, ownership of the brigade and all it's buckets is passed to the event
loop and the worker should not even touch memory in any of the buckets. The
worker cannot help but touch the brigade though (as part of unwinding the
call chain).  The key here I think is to reference count the brigade (or a
struct that references the brigade). The brigade is freed by the thread that
drops the ref count to 0 (we reference count cache_objects in
mod_mem_cache.c to handle race conditions between threads serving the cache
object and other threads garbage collecting the same cache object). Ditto
clearing the ptrans pool, which implies that we should create a new ref
counted object (ap_work_t ?) that contains references to ptrans, the brigade
and anything else we see fit to stick there (like a buffer to do true async
reads). A simplification is not attempt to buffer up responses to pipelined
requests.

Supporting ED/async reads is a bit more difficult.  There are two
fundamental problems:

* The functions that kickoff network reads (ap_read_request and
ap_get_mime_headers_core) are written to expect a successful read and cannot
handle being reentered multiple times on the same request.  This can be
fixed relatively easily by implementing a simple state machine used by
ap_process_http_connection() and refactoring the code in ap_read_request and
ap_get_mime_headers to segregate network reads and the code that processes
the received bytes. This work needs to be done for event driven reads and
true async reads.

* ap_get_brigade and ap_bucket_read allocates buckets under the covers (deep
in the bucket code).  This is -really- bad for async i/o because async i/o
relies on the buffers passed to read not be freed out from under the read
(this is not an issue for an event driven read I think). One solution is to
reimplement ap_get_brigade and the bucket read code to accept a buffer (and
len field) supplied by the application. This buffer would live in the
reference counted ap_work_t 'work element' mentioned earlier.

I would like to see all reads to be done ED/async, handlers to make the
decision on whether they want to do ED/async writes, and filters to handle
receiving APR_EAGAIN and APR_IO_PENDING correctly. Enabling writes to be
ED/async is a good first step.

Bill