You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Brian Pane <bp...@pacbell.net> on 2001/07/11 03:54:54 UTC

Hash tables and request headers Re: observations on fragmentation in SMS pools

Roy T. Fielding wrote:

>>Another reason why apr_hash_t isn't a good match for HTTP headers
>>is that the same field name may appear multiple times in the request
>>or response headers (section 4.2 of RFC 2616), but the hash table
>>implementation is designed around unique keys.
>>
>
>HTTP headers were designed to be processed as a hash table, hence the
>rules regarding combining multiple fields with the same name.  The only
>exception is Set-Cookie, because the folks at Netscape didn't follow
>the rules.
>
Ah, I was wondering why RFC 2109 seemed at odds with 2616 on the
subject of field-combining syntax.  :-(

If Set-Cookie is the only exception, it might not be a bad idea to
use apr_hash_t for the request and response headers.  I'm even willing
to contribute a patch for this, but it would impact both APR (slightly)
and Apache (a lot).  Here's my proposed approach:

  1. Add a 2nd create function for apr_hash_t that lets the caller
     supply three callback functions:
     - a hash function
     - an equality-check function
     - a concatenation function (if non-NULL, setting x=y followed by
       x=y results in concat_fn(y,z) being stored in the table as the
       value for x)
  2. Change the headers_in and headers_out fields of the request_req
     from tables to hash tables.  Supply case-insensitive hash and
     comparison functions for these hash tables.  For headers_out, also
     supply a concatenation function that builds comma-separated lists
     (and semicolon-separated lists when the key is set-cookie)
  3. Change all of the references to headers_in and headers_out in
     the core and standard modules to use the hash API instead of
     the table API.
  4. Document the change so that maintainers of 3rd party modules
     know how to make their code compile once again.

Thoughts?  Is a change like this worth doing?

--Brian




Re: Hash tables and request headers Re: observations on fragmentation in SMS pools

Posted by dean gaudet <de...@arctic.org>.
On Tue, 10 Jul 2001, Brian Pane wrote:

>   1. Add a 2nd create function for apr_hash_t that lets the caller
>      supply three callback functions:
>      - a hash function
>      - an equality-check function
>      - a concatenation function (if non-NULL, setting x=y followed by
>        x=y results in concat_fn(y,z) being stored in the table as the
>        value for x)

if you do concatenation this way you'll end up back in the land of O(n^2)
DOS attacks which we fixed years ago... this is why the merging is done
with those odd looking sort functions.  (it's in the archives if you want
to dig it all up, there were a lot of O(n^2) attacks :)

see get_mime_headers and apr_table_overlap.

i'm not sure it's appropriate to use generic interfaces with function
pointer indirections when you're trying to make something go faster.  if
you want it go fast, then don't make it generic.  make it specific.

(aside on modern cpus:  indirect function calls with multiple targets
hurt... they hurt because the CPU has difficulty predicting where the
branch/call target will be, so it has difficulty pre-fetching and decoding
across the branch/call.  this typically results in stalls in the pipeline
-- the number of stalls depends on how many stages from the front of the
pipeline the branch execute phase is.  some CPUs do just fine if your
indirect function call always has the same target -- but as soon as you
add a second target, this predictor will break.)

there's a lot of compiler theory which is appropriate here -- the HTTP
header names themselves are very similar to keywords in a language.
typically a tool such as gperf is used to map keywords to an enumerated
type when writing (fast) parsers for languages.  this would be a great
thing for a webserver other than apache, apache can't use it directly
because we're so general we allow people to modify the set of
recognized/generated HTTP headers at run-time.

i think ultimately what you want to be able to do is the following:

- for all common HTTP headers we've got a compile-time perfect hash
  mapping keyword <-> header_enum

- get_mime_headers uses that hash and stores the headers in a
  structure mapping header_enum -> header value

- r->headers_{in,out,err} manipulations use the integer values rather
  than the string names  (this allows table lookups simply by comparing
  integers rather than strings)

- an additional mechanism exists to add to the enum at run-time

the latter part is a pain in the ass.  (but it's pretty typical of apache
perf problems :)

initially at least, you might as well do the dynamic part using the
existing tables... so you've got a hybrid structure which supports quick
lookups for the core headers, and the same old same old for the rest.

do this just to find out how much of a perf benefit you're getting from
the perfect hash.  (hell just ignore the dynamic header crud for perf
testing.)

but you'll probably get resistance to putting this into the server for
good... 'cause if a module uses a "non-standard" header "BlahBlah" and
then rfc3942 HTTP/1.2 makes BlahBlah standard, and it's moved into the
perfect hash, then the module would be broken.

(i've got other ideas, but it's senseless to think about them until some
sort of perf improvement is shown.)

-dean




Re: Hash tables and request headers Re: observations on fragmentation in SMS pools

Posted by dean gaudet <de...@arctic.org>.
On Tue, 10 Jul 2001, Brian Pane wrote:

>   1. Add a 2nd create function for apr_hash_t that lets the caller
>      supply three callback functions:
>      - a hash function
>      - an equality-check function
>      - a concatenation function (if non-NULL, setting x=y followed by
>        x=y results in concat_fn(y,z) being stored in the table as the
>        value for x)

if you do concatenation this way you'll end up back in the land of O(n^2)
DOS attacks which we fixed years ago... this is why the merging is done
with those odd looking sort functions.  (it's in the archives if you want
to dig it all up, there were a lot of O(n^2) attacks :)

see get_mime_headers and apr_table_overlap.

i'm not sure it's appropriate to use generic interfaces with function
pointer indirections when you're trying to make something go faster.  if
you want it go fast, then don't make it generic.  make it specific.

(aside on modern cpus:  indirect function calls with multiple targets
hurt... they hurt because the CPU has difficulty predicting where the
branch/call target will be, so it has difficulty pre-fetching and decoding
across the branch/call.  this typically results in stalls in the pipeline
-- the number of stalls depends on how many stages from the front of the
pipeline the branch execute phase is.  some CPUs do just fine if your
indirect function call always has the same target -- but as soon as you
add a second target, this predictor will break.)

there's a lot of compiler theory which is appropriate here -- the HTTP
header names themselves are very similar to keywords in a language.
typically a tool such as gperf is used to map keywords to an enumerated
type when writing (fast) parsers for languages.  this would be a great
thing for a webserver other than apache, apache can't use it directly
because we're so general we allow people to modify the set of
recognized/generated HTTP headers at run-time.

i think ultimately what you want to be able to do is the following:

- for all common HTTP headers we've got a compile-time perfect hash
  mapping keyword <-> header_enum

- get_mime_headers uses that hash and stores the headers in a
  structure mapping header_enum -> header value

- r->headers_{in,out,err} manipulations use the integer values rather
  than the string names  (this allows table lookups simply by comparing
  integers rather than strings)

- an additional mechanism exists to add to the enum at run-time

the latter part is a pain in the ass.  (but it's pretty typical of apache
perf problems :)

initially at least, you might as well do the dynamic part using the
existing tables... so you've got a hybrid structure which supports quick
lookups for the core headers, and the same old same old for the rest.

do this just to find out how much of a perf benefit you're getting from
the perfect hash.  (hell just ignore the dynamic header crud for perf
testing.)

but you'll probably get resistance to putting this into the server for
good... 'cause if a module uses a "non-standard" header "BlahBlah" and
then rfc3942 HTTP/1.2 makes BlahBlah standard, and it's moved into the
perfect hash, then the module would be broken.

(i've got other ideas, but it's senseless to think about them until some
sort of perf improvement is shown.)

-dean