You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Eric Prud'hommeaux <er...@w3.org> on 2003/07/24 19:56:41 UTC
Re: Apache 2.0 mod_cache

On Wed, Jul 23, 2003 at 03:52:43PM -0700, James B Robinson wrote:
> Yo, Eric
> 
> Imagine my surprise while perusing the changelogs for Apache to run across:
> 
>   *) mod_disk_cache works much better. This module should still
>      be considered experimental. [Eric Prud'hommeaux]
> 
> I've just started trying to get either squid or apache to do mem caching
> on a half dozen 4G linux boxes and I'm being a bit frustrated by Apache's
> lack of reporting what is up in the cache. You know much about mod_cache,
> mod_mem_cache or know of people who do?

I was working on some packages that interact with the Vary header and
found that disk_cache wasn't paying correct attention to it anyways.
I hacked (at) them a bit and left them in a state where they could do
relatively simple caching operations. There should be no false hits
but there are opportunities for false misses (entities that could have
been served from cache but, do to proxy naivete, weren't). The miss
scenario is as follows:

-CACHE MISS BUG SCENARIO-
C1: GET path1 HTTP/1.1
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: fr;q=0.8, en;q=0.7
    Accept-Charset: iso-8859-5, unicode-1-1;q=0.8
    Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0
    Foo: bar
proxy passes request to document server (or upstream proxy) and gets
back S1:
    200 OK
    Vary: Accept,Accept-Language
    Expires: Wed, 31 Dec 2003 16:00:00 GMT
    A: b

    ...data...
and dutifully records the entity along with all of the headers in C1
that were listed in S1.Vary. It stores them in a spot on the disk
computed by the hash of path1.
hash(path1):
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: fr;q=0.8, en;q=0.7
    ...S1 data...

A subsequent request comes in for path1. Lucky case: has the same
headers and the proxy can match them against what it was written at
hash(path1). It may have different charset and encoding as they were
not listed in the Vary header.

Another request comes in with a different Accept-Language header:
C2: GET path1 HTTP/1.1
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: esparanto,iso-latin-pig

A false hit would be if the the proxy said "why I've got one of those"
and gave back the cached entity. I think I made sure that won't happen.
But, I believe the proxy will replace the previously cached entity by
what comes back from S2 (upstream response to C2).
hash(path1):
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: esparanto,iso-latin-pig
    ...S2 data...

-APPRAISAL OF CURRENT CODE-
The false miss comes when another request like C1 comes in. The proxy
no longer maintains the response S1 so it gets a cache miss and sends
a request upstream. The resulting inefficiency can be estimated by
observing the traffic coming through and seeing if the clients are
sending requests that vary by something listed in the responses Vary
header, and that they are doing this more quickly than the entity
would expire of natural causes.

I believe that the current implementation is well worth its weight in
CPU time and maintenance. On the maintenance front, I don't beleive
mod_disk_cache cleans up after itself, but a simple find of files with
an access time older than some interval will give you a nice least-
recently-used algorithm. Or you can give it its own filesystem and let
it bump its head and clean up whenever you feel like it.

The CPU involved in a cache miss is pretty minimal, a couple entries
into a module, computing a hash, and a file open failure. The CPU
involved in a cache hit would have to be cracking large keys or
searching for aliens before it would be comparable with the time to
have sent the request to a distant server.

So, we have a working system, but I think it could be improved easily:

-PROPOSED FIX-
The inefficiency comes from storing the cache entry at a hash
calculated only by the request path. If the varied headers were added
to that hash, we would have a place to store all the variations of the
entity. But, we'd have to be clairvoyant to know which of the request
headers that came in would be needed to calculate the hash. To find
this, I believe hash(path) needs to contain a list of the Vary headers
for the server response(s). ie, after the two request above, the proxy
would have

-PROPOSAL P1-
hash(path1):
    Accept
    Accept-Language

hash(path1 . Accept: ... . A-L: fr;q=0.8, en;q=0.7):
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: fr;q=0.8, en;q=0.7
    ...S1 data...

hash(path1 . Accept: ... . A-L: esparanto,iso-latin-pig):
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: esparanto,iso-latin-pig
    ...S2 data...

-PROPOSAL P1a-
Alternatively, hash(path1) could compute a directory name. The Vary
list could be stored in hash(path1)/vary and the other documents could
be stored in entries like 
hash(path1)/hash(Accept: ... . A-L: esparanto,iso-latin-pig). It should
be easy to ensure that the name produced by hashing the varied headers
would never collide with the special filename "vary".

I'd like some feedback on the comparitive costs of two files vs. a
directory with two files in it as this is the most common case for a
non-varied request. The cost would appear to be higher as we are
adding a directory in P1a, but that may help break up large
directories. But I don't know filesystems. Who does?

-ROCKS TO BE THROWN-

This solution assumes that the Vary header will be constant for a
given path. HTTP does not make this promise, so the cache module will
need to rewrite the Vary header list if it was different from the Vary
header of any response it recieve from upstream. We could solve this
problem by...

-PROPOSAL P1a1-
...walking the directory in P1a (above) to look for the first
one that has headers all matching the current cache candidate. This
elides the scenario where Vary headers are seemingly inconsistent:

R3: GET /path1 HTTP/1.1
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
S3: 200 OK
    Vary: Accept
R3: GET /path1 HTTP/1.1
    Accept: text/html;q=0.5,application/soap+xml;q=1.0
    Accept-Language: fr;q=0.8, en;q=0.7
S3: 200 OK
    Vary: Accept,Language

but that's just messed up anyways.

Who's baby is disk cache now anyways? 
`cvs log modules/experimental/mod_disk_cache.c` shows brianp doing a
bit of protocol-level hacking there. My last patch was
[[
date: 2002/08/18 12:33:05;  author: stoddard;  state: Exp;  lines: +40 -2
Get mod_disk_cache working.

Submitted by: Eric Prud'hommeaux
Reviewes by: Paul Reder, Bill Stoddard
]]

I'd like to hear from folks about the proposals above (P1, P1a, P1a1
and son of the return of proposal P1a1a1a strikes back in 3D) and the
filesystem metrics.
Also, is anyone here using disk_cache? It would be cool to make this
a nice showpiece for how HTTP caching is supposed to work.
-- 
-eric

office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.