You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Stefan Eissing <st...@greenbytes.de> on 2015/10/21 16:18:51 UTC

buckets and connections (long post)

(Sorry for the long post. It was helpful for myself to write it. If this does not
 hold your interest long enough, just ignore it please.)

As I understand it - and that is incomplete - we have a usual request processing like this:

A)
worker:
  conn <--- cfilter <--- rfilter
     |--b-b-b-b-b-b-b-b...

with buckets trickling to the connection through connection and request filters, state being
held on the stack of the assigned worker.

Once the filters are done, we have

B)
  conn 
     |--b-b-b-b-b...

just a connection with a bucket brigade yet to be written. This no longer needs a stack. The
worker can (depending on the mpm) be re-assigned to other tasks. Buckets are streamed out based
on io events (for example).

To go from A) to B), the connection needs to set-aside buckets, which is only real work for
some particular type of buckets. Transient ones for example, where the data may reside on the 
stack which is what we need to free in order to reuse the worker.

This is beneficial when the work for setting buckets aside has much less impact on the system
than keeping the worker threads allocated. This is especially likely when slow clients are involved
that take ages to read a response.

In HTTP/1.1, usually a response is fully read by the client before it makes the next request. So,
at least half the roundtrip time, the connection will be in state

C)
  conn 
     |-

without anything to read or write. But when the next request come in, it gets assigned a worker and is
back in state A). Repeat until connection close.

Ok, so far?


How good does this mechanism work for mod_http2? On the one side it's the same, on the other quite different.

On the real, main connection, the master connection, where the h2 session resides, things are
pretty similar with some exceptions:
- it is very bursty. requests continue to come in. There is no pause between responses and the next request.
- pauses, when they happen, will be longer. clients are expected to keep open connections around for
  longer (if we let them).
- When there is nothing to do, mod_http2 makes a blocking read on the connection input. This currently
  does not lead to the state B) or C). The worker for the http2 connection stays assigned. This needs
  to improve.

On the virtual, slave connection, the one for HTTP/2 streams, aka. requests, things are very different:
- the slave connection has a socket purely for the looks of it. there is no real connection.
- event-ing for input/output is done via conditional variables and mutex with the thread working on
  the main connection
- the "set-aside" happens, when output is transferred from the slave connection to the main one. The main
  connection allows a configurable number of maximum bytes buffered (or set-aside). Whenever the rest
  of the response fits into this buffer, the slave connection will be closed and the slave worker is
  reassigned. 
- Even better, when the response is a file bucket, the file is transferred, which is not counted 
  against the buffer limit (as it is just a handle). Therefore, static files are only looked up 
  by a slave connection, all IO is done by the master thread.

So state A) is the same for slave connections. B) only insofar as the set-aside is replaced with the 
transfer of buckets to the master connection - which happens all the time. So, slave connections are
just in A) or are gone. slave connections are not kept open.


This is the way it is implemented now. There may be other ways, but this is the way we have. If we
continue along this path, we have the following obstacles to overcome:
1. the master connection probably can play nicer with the MPM so that an idle connection uses less
   resources
2. The transfer of buckets from the slave to the master connection is a COPY except in case of
   file buckets (and there is a limit on that as well to not run out of handles).
   All other attempts at avoiding the copy, failed. This may be a personal limitation of my APRbilities.
3. The amount of buffered bytes should be more flexible per stream and redistribute a maximum for 
   the whole session depending on load.
4. mod_http2 needs a process wide Resource Allocator for file handles. A master connection should
   borrow n handles at start, increase/decrease the amount based on load, to give best performance
5. similar optimizations should be possible for other bucket types (mmap? immortal? heap?)
6. pool buckets are very tricky to optimize, as pool creation/destroy is not thread-safe in general
   and it depends on how the parent pools and their allocators are set up. 
   Early hopes get easily crushed under load.
7. The buckets passed down on the master connection are using another buffer - when on https:// -
   to influence the SSL record sizes on write. Another COPY is not nice, but write performance
   is better this way. The ssl optimizations in place do not work for HTTP/2 as it has other
   bucket patterns. We should look if we can combine this into something without COPY, but with
   good sized SSL writes.


//Stefan



> Am 21.10.2015 um 00:20 schrieb Jim Jagielski <ji...@jaguNET.com>:
> 
> Sorry for not being on-list a bit lately... I've been a bit swamped.
> 
> Anyway, I too don't want httpd to go down the 'systemd' route: claim
> something as broken to explain, and justify, ripping it out
> and re-creating anew. Sometimes this needs be done, but not
> very often. And when stuff *is* broken, again, it's best to
> fix it than replace it (usually).


Re: buckets and connections (long post)

Posted by Graham Leggett <mi...@sharp.fm>.
On 22 Oct 2015, at 5:55 PM, Stefan Eissing <st...@greenbytes.de> wrote:

>> With the async filters this flow control is now made available to every filter in the ap_filter_setaside_brigade() function. When mod_http2 handles async filters you’ll get this flow control for free.
> 
> No, it will not. The processing of responses is very different.
> 
> Example: there is individual flow control of responses in HTTP/2. Clients do small window sizes on images, like 64KB in order to get small images completely or only the meta data of large ones. For these large files, the client does not send flow-control updates until it has received all other
> resources it is interested in. *Then* it tells the server to go ahead and send the rest of these images.
> 
> This means a file bucket for such images will hang around for an indefinite amount of time and a filter cannot say, "Oh, I have n file buckets queued, let's block write them first before I accept more." The server cannot do that.

What you’re describing is a DoS.

A client can’t tie up resources for an arbitrary amount of time, the server needs to be under control of this. If a client wants part of a file, the server needs to open the file, send the part, then close the file and be done. If the client wants more, then the server opens up the file again, sends more, and then is done.

> I certainly do not want to reinvent the wheel here and I am very glad about any existing solution and getting told how to use them. But please try to understand the specific problems before saying "we must have already a solution for that, go find it. you will see…"

The http2 code needs to fit in with the code that is already there, and most importantly it needs to ensure it doesn’t clash with the existing mechanisms. If an existing mechanism isn’t enough, it can be extended, but they must not be bypassed.

The mechanism in the core keeps track of the number of file buckets, in-memory buckets and requests “in flight”, and then blocks if this gets too high. Rather block and live to fight another day than try an open too many files and get spurious failures as you run out of file descriptors.

The async filters gives you the ap_filter_should_yield() function that will tell you if downstream is too full and you should hold off sending more data. For example, don’t accept another request if you’ve already got too many requests in flight.

Regards,
Graham
—


Re: buckets and connections (long post)

Posted by Stefan Eissing <st...@greenbytes.de>.
> Am 21.10.2015 um 16:48 schrieb Graham Leggett <mi...@sharp.fm>:
> 
> On 21 Oct 2015, at 4:18 PM, Stefan Eissing <st...@greenbytes.de> wrote:
> [...]
>> 3. The amount of buffered bytes should be more flexible per stream and redistribute a maximum for 
>>  the whole session depending on load.
>> 4. mod_http2 needs a process wide Resource Allocator for file handles. A master connection should
>>  borrow n handles at start, increase/decrease the amount based on load, to give best performance
>> 5. similar optimizations should be possible for other bucket types (mmap? immortal? heap?)
> 
> Right now this task is handled by the core network filter - it is very likely this problem is already solved, and you don’t need to do anything.
> 
> If the problem still needs solving, then the core filter is the place to do it. What the core filter does is add up the resources taken up by different buckets and if these resources breach limits, blocking writes are done until we’re below the limit again. This provides the flow control we need.

I know that code and it does not help HTTP/2 processing.

> With the async filters this flow control is now made available to every filter in the ap_filter_setaside_brigade() function. When mod_http2 handles async filters you’ll get this flow control for free.

No, it will not. The processing of responses is very different.

Example: there is individual flow control of responses in HTTP/2. Clients do small window sizes on images, like 64KB in order to get small images completely or only the meta data of large ones. For these large files, the client does not send flow-control updates until it has received all other
resources it is interested in. *Then* it tells the server to go ahead and send the rest of these images.

This means a file bucket for such images will hang around for an indefinite amount of time and a filter cannot say, "Oh, I have n file buckets queued, let's block write them first before I accept more." The server cannot do that.

I certainly do not want to reinvent the wheel here and I am very glad about any existing solution and getting told how to use them. But please try to understand the specific problems before saying "we must have already a solution for that, go find it. you will see..."

//Stefan



Re: buckets and connections (long post)

Posted by Graham Leggett <mi...@sharp.fm>.
On 22 Oct 2015, at 6:03 PM, Stefan Eissing <st...@greenbytes.de> wrote:

> This is all true and correct - as long as all this happens in a single thread. If you have multiple threads and create sub pools for each from a main pool, each and every create and destroy of these sub-pools, plus any action on the main pool must be mutex protected. I found out. 

Normally if you’ve created a thread from a main pool, you need to create a pool cleanup for that thread off the main pool that is registered with apr_pool_pre_cleanup_register(). In this cleanup, you signal the thread to shut down gracefully and then apr_thread_join to wait for the thread to shut down, then the rest of the pool can be cleaned up.

The “pre” is key to this - the cleanup must run before the subpool is cleared.

> Similar with buckets. When you create a bucket in one thread, you may not destroy it in another - *while* the bucket_allocator is being used. bucket_allocators are not thread-safe, which means bucket_brigades are not, which means that all buckets from the same brigade must only be used inside a single thread.

“…inside a single thread at a time”.

The event MPM is an example of this in action.

A connection is handled by an arbitrary thread until that connection must poll. At that point it goes back into the pool of connections, and when ready is given to another arbitrary thread. In this case the threads are handled “above” the connections, so the destruction of a connection doesn’t impact a thread.

> This means for example that, even though mod_http2 manages the pool lifetime correctly, it cannot pass a response bucket from a request pool in thread A for writing onto the  main connection in thread B, *as long as* the response is not complete and thread A is still producing more buckets with the same allocator. etc. etc.
> 
> That is what I mean with not-thread-safe.

In this case you have different allocators, and so must pass the buckets over.

Remember that being lock free is a feature, not a bug. As soon as you add mutexes you add delay and slow everything down, because the world must stop until the condition is fulfilled.

A more efficient way of handling this is to use some kind of IPC so that the requests signal the master connection and go “I’ve got data for you”, after which the requests don’t touch that data until the master has said “I;ve got it, feel free to send more”. That IPC could be a series of mutexes, or a socket of some kind. Anything that gets rid of a global lock.

That doesn’t mean request processing must stop dead, that request just gets put aside and that thread is free to work on another request.

I’m basically describing the event MPM.

Regards,
Graham
—


Re: buckets and connections (long post)

Posted by Stefan Eissing <st...@greenbytes.de>.
> Am 21.10.2015 um 16:48 schrieb Graham Leggett <mi...@sharp.fm>:
> 
> On 21 Oct 2015, at 4:18 PM, Stefan Eissing <st...@greenbytes.de> wrote:
>> 6. pool buckets are very tricky to optimize, as pool creation/destroy is not thread-safe in general
>>  and it depends on how the parent pools and their allocators are set up. 
>>  Early hopes get easily crushed under load.
> 
> As soon as I see “buckets aren’t thread safe” I read it as “buckets are being misused” or “pool lifetimes are being mixed up".
> 
> Buckets arise from allocators, and you must never try add a bucket from one allocator into a brigade sourced from another allocator. For example, if you have a bucket allocated from the slave connection, you need to copy it into a different bucket allocated from the master connection before trying to add it to a master brigade.
> 
> Buckets are also allocated from pools, and pools have different lifetimes depending on what they were created for. If you allocate a bucket from the request pool, that bucket will vanish when the request pool is destroyed. Buckets can be passed from one pool to another, that is what “setaside” means.
> 
> It is really important to get the pool lifetimes right. Allocate something accidentally from the master connection pool on a slave connection and it appears to work, because generally the master outlives the slave. Until the master is cleaned up first, and suddenly memory vanishes unexpectedly in the slave connections - and you crash.
> 
> There were a number of subtle bugs in the proxy where buckets had been allocated from the wrong pool, and all sorts of weirdness ensued. Make sure your pool lifetimes are allocated correctly and it will work.

This is all true and correct - as long as all this happens in a single thread. If you have multiple threads and create sub pools for each from a main pool, each and every create and destroy of these sub-pools, plus any action on the main pool must be mutex protected. I found out. 

Similar with buckets. When you create a bucket in one thread, you may not destroy it in another - *while* the bucket_allocator is being used. bucket_allocators are not thread-safe, which means bucket_brigades are not, which means that all buckets from the same brigade must only be used inside a single thread.

This means for example that, even though mod_http2 manages the pool lifetime correctly, it cannot pass a response bucket from a request pool in thread A for writing onto the  main connection in thread B, *as long as* the response is not complete and thread A is still producing more buckets with the same allocator. etc. etc.

That is what I mean with not-thread-safe.

//Stefan

Re: buckets and connections (long post)

Posted by Graham Leggett <mi...@sharp.fm>.
On 22 Oct 2015, at 6:04 PM, Stefan Eissing <st...@greenbytes.de> wrote:

>> mod_ssl already worries about buffering on it’s own, there is no need to recreate this functionality. Was this not working?
> 
> As I wrote "it has other bucket patterns", which do not get optimized by the coalescing filter of mod_ssl.

Then we must fix the coalescing filter in mod_ssl.

Regards,
Graham
—


Re: buckets and connections (long post)

Posted by Stefan Eissing <st...@greenbytes.de>.
> Am 21.10.2015 um 16:48 schrieb Graham Leggett <mi...@sharp.fm>:
> 
> On 21 Oct 2015, at 4:18 PM, Stefan Eissing <st...@greenbytes.de> wrote:
> 
>> 7. The buckets passed down on the master connection are using another buffer - when on https:// -
>>  to influence the SSL record sizes on write. Another COPY is not nice, but write performance
>>  is better this way. The ssl optimizations in place do not work for HTTP/2 as it has other
>>  bucket patterns. We should look if we can combine this into something without COPY, but with
>>  good sized SSL writes.
> 
> mod_ssl already worries about buffering on it’s own, there is no need to recreate this functionality. Was this not working?

As I wrote "it has other bucket patterns", which do not get optimized by the coalescing filter of mod_ssl.

//Stefan

Re: buckets and connections (long post)

Posted by Graham Leggett <mi...@sharp.fm>.
On 22 Oct 2015, at 5:43 PM, Stefan Eissing <st...@greenbytes.de> wrote:

>> The blocking read breaks the spirit of the event MPM.
>> 
>> In theory, as long as you enter the write completion state and then not leave until your connection is done, this problem will go away.
>> 
>> If you want to read instead of write, make sure the CONN_SENSE_WANT_READ option is set on the connection.
> 
> This does not parse. I do not understand what you are talking about. 
> 
> When all streams have been passed into the output filters, the mod_http2 session does a 
> 
>    status = ap_get_brigade(io->connection->input_filters,...)  (h2_conn_io.c, line 160)
> 
> similar to what ap_read_request() -> ap_rgetline_core() does. (protocol.c, line 236)
> 
> What should mod_http2 do different here?

What ap_read_request does is:

- a) read the request (parse)
- b) handle the request (make decisions on what to do, internally redirect, rewrite, etc etc)
- c) exit, and let the MPM complete the request in the write_completion phase.

What you want to do is move the request completion into a filter, like mod_cache does. You start by setting up your request, you parse headers, you do the HTTP2 equivalent of ap_read_request(), then you do the actual work inside a filter. Look at the CACHE_OUT and CACHE_SAVE filters as examples.

To be more specific, in the handler that detects HTTP/2 you add a filter that processes the data, then write an EOS bucket to kick off the process and leave. The filter takes over.

The reason for this is you want to escape the handler phase as soon as possible, and leave the MPM to do it’s work.

Regards,
Graham
—




Re: buckets and connections (long post)

Posted by Stefan Eissing <st...@greenbytes.de>.
(I split these up, since answers touch on different topics):



> Am 21.10.2015 um 16:48 schrieb Graham Leggett <mi...@sharp.fm>:
> 
> On 21 Oct 2015, at 4:18 PM, Stefan Eissing <st...@greenbytes.de> wrote:
> 
>> How good does this mechanism work for mod_http2? On the one side it's the same, on the other quite different.
>> 
>> On the real, main connection, the master connection, where the h2 session resides, things are
>> pretty similar with some exceptions:
>> - it is very bursty. requests continue to come in. There is no pause between responses and the next request.
>> - pauses, when they happen, will be longer. clients are expected to keep open connections around for
>> longer (if we let them).
>> - When there is nothing to do, mod_http2 makes a blocking read on the connection input. This currently
>> does not lead to the state B) or C). The worker for the http2 connection stays assigned. This needs
>> to improve.
> 
> The blocking read breaks the spirit of the event MPM.
> 
> In theory, as long as you enter the write completion state and then not leave until your connection is done, this problem will go away.
> 
> If you want to read instead of write, make sure the CONN_SENSE_WANT_READ option is set on the connection.

This does not parse. I do not understand what you are talking about. 

When all streams have been passed into the output filters, the mod_http2 session does a 

    status = ap_get_brigade(io->connection->input_filters,...)  (h2_conn_io.c, line 160)

similar to what ap_read_request() -> ap_rgetline_core() does. (protocol.c, line 236)

What should mod_http2 do different here?

//Stefan

Re: buckets and connections (long post)

Posted by Graham Leggett <mi...@sharp.fm>.
On 21 Oct 2015, at 4:18 PM, Stefan Eissing <st...@greenbytes.de> wrote:

> How good does this mechanism work for mod_http2? On the one side it's the same, on the other quite different.
> 
> On the real, main connection, the master connection, where the h2 session resides, things are
> pretty similar with some exceptions:
> - it is very bursty. requests continue to come in. There is no pause between responses and the next request.
> - pauses, when they happen, will be longer. clients are expected to keep open connections around for
>  longer (if we let them).
> - When there is nothing to do, mod_http2 makes a blocking read on the connection input. This currently
>  does not lead to the state B) or C). The worker for the http2 connection stays assigned. This needs
>  to improve.

The blocking read breaks the spirit of the event MPM.

In theory, as long as you enter the write completion state and then not leave until your connection is done, this problem will go away.

If you want to read instead of write, make sure the CONN_SENSE_WANT_READ option is set on the connection.

(You may find reasons that stop this working, if so, these need to be isolated and fixed).

> This is the way it is implemented now. There may be other ways, but this is the way we have. If we
> continue along this path, we have the following obstacles to overcome:
> 1. the master connection probably can play nicer with the MPM so that an idle connection uses less
>   resources
> 2. The transfer of buckets from the slave to the master connection is a COPY except in case of
>   file buckets (and there is a limit on that as well to not run out of handles).
>   All other attempts at avoiding the copy, failed. This may be a personal limitation of my APRbilities.

This is how the proxy does it.

Buckets owned by the backend conn_rec are copied and added to the frontend conn_rec.

> 3. The amount of buffered bytes should be more flexible per stream and redistribute a maximum for 
>   the whole session depending on load.
> 4. mod_http2 needs a process wide Resource Allocator for file handles. A master connection should
>   borrow n handles at start, increase/decrease the amount based on load, to give best performance
> 5. similar optimizations should be possible for other bucket types (mmap? immortal? heap?)

Right now this task is handled by the core network filter - it is very likely this problem is already solved, and you don’t need to do anything.

If the problem still needs solving, then the core filter is the place to do it. What the core filter does is add up the resources taken up by different buckets and if these resources breach limits, blocking writes are done until we’re below the limit again. This provides the flow control we need.

With the async filters this flow control is now made available to every filter in the ap_filter_setaside_brigade() function. When mod_http2 handles async filters you’ll get this flow control for free.

> 6. pool buckets are very tricky to optimize, as pool creation/destroy is not thread-safe in general
>   and it depends on how the parent pools and their allocators are set up. 
>   Early hopes get easily crushed under load.

As soon as I see “buckets aren’t thread safe” I read it as “buckets are being misused” or “pool lifetimes are being mixed up".

Buckets arise from allocators, and you must never try add a bucket from one allocator into a brigade sourced from another allocator. For example, if you have a bucket allocated from the slave connection, you need to copy it into a different bucket allocated from the master connection before trying to add it to a master brigade.

Buckets are also allocated from pools, and pools have different lifetimes depending on what they were created for. If you allocate a bucket from the request pool, that bucket will vanish when the request pool is destroyed. Buckets can be passed from one pool to another, that is what “setaside” means.

It is really important to get the pool lifetimes right. Allocate something accidentally from the master connection pool on a slave connection and it appears to work, because generally the master outlives the slave. Until the master is cleaned up first, and suddenly memory vanishes unexpectedly in the slave connections - and you crash.

There were a number of subtle bugs in the proxy where buckets had been allocated from the wrong pool, and all sorts of weirdness ensued. Make sure your pool lifetimes are allocated correctly and it will work.

> 7. The buckets passed down on the master connection are using another buffer - when on https:// -
>   to influence the SSL record sizes on write. Another COPY is not nice, but write performance
>   is better this way. The ssl optimizations in place do not work for HTTP/2 as it has other
>   bucket patterns. We should look if we can combine this into something without COPY, but with
>   good sized SSL writes.

mod_ssl already worries about buffering on it’s own, there is no need to recreate this functionality. Was this not working?

Regards,
Graham
—