You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Dan McCormick <da...@metro.net> on 2000/11/16 07:37:41 UTC

mod_proxy caching documentation

Hi,

After struggling with trying to figure out mod_proxy's caching algorithm
and noting from the list archive's that others had, too -- and due to
the dearth of existing documentation on the subject -- I came up with
some documentation below by sifting through the source code.  Most of it
isn't explicitly mod_perl-related, but I hope those trying to set it up
will find it useful.  Included at the end is a Perl script to determine
the filename that mod_proxy uses to cache files, which is helpful in
manually cleaning up the cache.  If anyone has comments or 
suggestions, please let me know.

Thanks,
Dan

------------------------------------

Setting up Apache with mod_proxy to cache content from a mod_perl server

The documentation for mod_proxy can be found at
http://httpd.apache.org/docs/mod/mod_proxy.html.  Unfortunately, aside
from the configuration parameters, not much detail is provided on how to
set up mod_proxy to cache pages from a downstream server.  This
explanation hopes to fill that void.  Most of its content was derived by
going through the mod_proxy.c, proxy_cache.c, and proxy_util.c source
files and comments in the src/modules/proxy directory of the Apache
1.3.12 distribution.

* The Short Story

In short, mod_proxy will cache all requests that contain a Last-Modified
header and an Expires header.  You can insert this into your mod_perl
scripts with something like this:

use Apache::File ();
use HTTP::Date;

$r->set_last_modified((stat $r->finfo)[9]); # see Eagle book p. 493 for
explanation
$r->header_out('Expires', HTTP::Date::time2str(time + 24*60*60)); #
expires in one day
                            
The page will live in the cache until the current time passes the time
defined by the Expires header or the time since the file was cached
exceeds the CacheMaxExpire parameter as set in the server config file.

* The Long Story

To understand how the caching proxy server works, let's trace the flow
of two simple HTTP exchanges for the same file, from the browser request
to the returned page.

- The browser makes a request to the proxy server like this:

GET /index.html HTTP/1.0

- The proxy server takes the URL and converts it to a filename on your
filesystem.  This filename has no resemblance to the actual URL. 
Instead, it is an MD5 hash of the fully qualified URL (e.g.
http://www.myserver.com:80/mypage.html) to the document and is broken up
in a number of directory levels, as defined by the CacheDirLevels
parameter in the config file.  (WHY DOES IT MATTER HOW MANY DIR LEVELS
ARE IN THE CACHE?)  Each of these directories will have a certain number
of characters in its name, as defined by the CacheDirLength parameter in
the config file.  The directories will live under CacheRoot, also
defined in the config file.  For example, /index.html might be converted
to /proxy_cache/m/EYRopVKBHMrHd2VF6WXOQ (with CacheDirLevels and
CacheDirLength set to 1 and CacheRoot set to /proxy_cache).

- For this example, we'll assume that at this point the cached file does
not exist.  The proxy server then consequently forwards the request to
the mod_perl server and gets a response back.  The response will then be
cached UNLESS any of the following conditions are true
(ap_proxy_cache_update):
 - The HTTP status returned by the mod_perl server is not one of OK,
HTTP_MOVED_PERMANENTLY, or HTTP_NOT_MODIFIED
 - The response does not contain an Expires header
 - The response contains an Expires header that Apache can't parse
 - The HTTP status is OK but there's not a Last-Modified header
 - The mod_perl server sent only an HTTP header
 - The mod_perl server sent an Authorization field in the header
(Furthermore, if any of the above conditions are met, any existing
cached file will be deleted.)

- If the server decides to cache the file, it will store the file
exactly as it was received from the mod_perl server, with the addition
of a one-line header at the start of the file.  This header contains the
following information in the following format:
<current time> <last modified time> <expiration time> <"version">
<content length>

All times are stored as hex seconds since 1970 and are taken from the
HTTP header sent by the mod_perl server.  If the current time cannot be
parsed from this header, the proxy server determines the current time
itself and uses that; if the Last Modified time cannot be parsed, it is
set to the Last Modified time of the existing cached file, if it exists;
if the Last Modified time is in the future, it is set to the current
time as determined previously; if the Expires time cannot be parsed and
a Last Modified time exists from the previous step, then the Expires
time is set to "now + min((date - lastmod) * factor, maxexpire)" (as
noted in the source code comments) where factor and maxexpire are the
CacheLastModifiedFactor and CacheMaxExpire parameters in the config
file; if the Expires time cannot be parsed and there is no Last Modified
time, then the Expires time is set to "now + defaultexpire", where
'defaultexpire' is the CacheDefaultExpire parameter in the config file.

The "version" number stored in this file is an integer that is
incremented each time the file is overwritten by a fresh response from
the mod_perl server.

The permissions on the cached files are quite strict: they can be read
and written only by the web server user.  Furthermore, the directories
created in the cache filesystem can only be viewed by the web server
user.

- If the status sent by the mod_perl server was a "304 Not Modified"
header and the "Last Modified" time, as determined in the steps above,
is before the "If-Modified-Since" time sent by the browser, then the
proxy server sends a "304 Not Modified" response to the browser. 
Otherwise, the full file, as returned by the mod_perl server, is sent to
the browser.

- Time passes.

- The browser makes another request for the file.  The URL is again
converted to a filename and this time the file is found in the cache. 
At this point, the following checks are performed (ap_proxy_cache_check)
and, if all are true, the server proceeds to the next step.  If any are
false, the server does not use the cached file:
 - The request is a GET request
 - There is no 'Pragma: No-Cache' in the HTTP header sent by the browser
 - There is no 'Authorization' field in the HTTP header sent by the
browser

(NOTE this should mean that all HEAD requests are passed through to the
mod_perl server.  However, in practice, this does not seem to be the
case.  Instead, HEAD requests are passed through unless there is an
unexpired file in the cache (retrieved via a previous GET request), in
which case that is used.  I may be misreading the code -- the check for
the GET request is on line 714 of proxy_cache.c, if you're interested.)

- If the above conditions are true, the proxy server opens the cached
file, examines the first line of data, and follows this logic:

        If the "Expires" time listed in the first line of the cached
file has not been reached then it will use the cached file.  It must
then decide whether to send the file or just send a "304 Not Modified"
header.  If the "If-Modified-Since" time sent by the browser is greater
than or equal to the "Last-Modified" time in the cached file then the
proxy server sends a "304 Not Modified" response back to the browser,
telling it to use its locally cached copy of the file; otherwise, it
sends the cached file.

        If the "Expires" time *has* been reached, the proxy server then
re-requests the file from the mod_perl server, sends that back to the
client, and writes the new response to the cache file.

Various Question:

* Is / cached separately from /index.html?

Yes.  The cache filenames are based on the URL before any aliasing takes
place.

* How can I tell if mod_proxy is caching requests?

Open two terminal windows and tail the output of the access logs (i.e.,
'tail -f access_log') on both the proxy and the mod_perl server.  Then,
use your browser to make a request to the proxy server and watch both
logs.  If you see your request in the mod_perl server access log, the
file's not being cached; if you don't, it is.

* Can I store the cache on an NFS server used by two or more httpd
binaries serving the same document root?

Yes.  The servers will all use the same names for the cache files.

* How are HEAD requests handled?

HEAD requests are passed to the mod_perl server UNLESS the URL has been
cached previously with a GET request, in which case they are served from
the cache.

* Is there a quick hack I can use to include the Expires and
Last-Modified headers in my Apache::ASP scripts?

Yes.  Throw this into your global.asa file:

sub Script_OnStart {
        $Response->{Expires} = 60*60*24; # expires in a day
        my $last_modified = (stat $0)[9];
        $main::Response->AddHeader ('Last-Modified',
HTTP::Date::time2str $last_modified);
}

This will add an Expires field (of one day in the future) and a
Last-Modified field (of the file modification time) to all your pages.

* How does the garbade collection system work?

I don't know; I didn't investigate that.  Sorry.  Presumably, it combs
the cache every CacheGcInterval hours, as defined in the config file,
and deletes files if the cache is greater than CacheSize, also defined
in the config file.  Exactly *which* files are deleted is still a
mystery.

* How do I clear the cache?

The proxy server will re-request a file when it's expiration date, as
stored in the first line of the cached file, has been reached.  So you
could write a routine to change that expiration date.  Or you could just
delete the file.  But finding the file to delete is tricky.  Here's a
script that was ported from the Apache C code that should work (NOTE
this works only for case-sensitive filesystems; a slightly separate
algorithm is used for case-insensitive filesystems -- see the
proxy_util.c in the Apache sources):

#!/usr/bin/perl
# Convert $URL to a mod_proxy cache filename
# Ported blindly from src/modules/proxy/proxy_util.c in the Apache
1.3.12 distribution

use strict;
use Digest::MD5 qw(md5);

my $URL = 'http://www.myserver.com:80/myfile.html'; # this should be the
URL that the proxy server is fetching from the mod_perl server

my $ndepth = 1; # set to CacheDirLevels in your proxy conf file
my $nlength = 1; # set to CacheDirLength in your proxy conf file

my @digest = split //, md5($URL);
my @enc_table = split //,
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_@";

my $x = ''; my @tmp = ();
my ($i, $k, $d);
for ($i = 0, $k = 0; $i < 15; $i += 3) {
    $x = (ord($digest[$i]) << 16) | (ord($digest[$i + 1]) << 8) |
ord($digest[$i + 2]);
    $tmp[$k++] = $enc_table[$x >> 18];
    $tmp[$k++] = $enc_table[($x >> 12) & 0x3f];
    $tmp[$k++] = $enc_table[($x >> 6) & 0x3f];
    $tmp[$k++] = $enc_table[$x & 0x3f];
}

# one byte left
$x = ord($digest[15]);
$tmp[$k++] = $enc_table[$x >> 2];   # use up 6 bits
$tmp[$k++] = $enc_table[($x << 4) & 0x3f];

# now split into directory levels

my @val = ();
for ($i = $k = $d = 0; $d < $ndepth; ++$d) {
#   memcpy(&val[i], &tmp[k], nlength);
    @val[$i..($i+$nlength)] = @tmp[$k..($k+$nlength)];
        
    $k += $nlength;
    $val[$i + $nlength] = '/';
    $i += $nlength + 1;
}

#memcpy(&val[i], &tmp[k], 22 - k);
@val[$i..($i+22-$k)] = @tmp[$k..22];

print join ('', @val), "\n";

Re: mod_proxy caching documentation

Posted by barries <ba...@slaysys.com>.
On Thu, Nov 16, 2000 at 01:37:41AM -0500, Dan McCormick wrote:
> 
> I came up with > some documentation below by sifting through the
> source code.

Excellent, thanks!

If a malformed Expires: prevents mod_proxy from caching a response (

> The response will then be
> cached UNLESS any of the following conditions are true
> (ap_proxy_cache_update):
[snip]
>  - The response contains an Expires header that Apache can't parse

), why do they go to some lengths to make up for a malformed one (

> if the Expires time cannot be parsed and
> a Last Modified time exists from the previous step, then the Expires
> time is set to "now + min((date - lastmod) * factor, maxexpire)" (as
> noted in the source code comments)

)?  I'm assuming that it can because that's a bit of extra logic that
wouldn't need to be there otherwise.  Or maybe it's leftover code that
never fires.

I thought (not that I remember why) that it didn't need an Expires: header
and that it would make up a value of it's own based on the .conf settings.

- Barrie

Re: mod_proxy caching documentation

Posted by Perrin Harkins <pe...@primenet.com>.
On Thu, 16 Nov 2000, Joshua Chamas wrote:
> I think it would be interesting if you chronicled the capacity 
> improvements to your site using the mod_proxy server like this.  
> I don't know how well mod_proxy does this caching from a performance
> perspective, and it might be nice to see some numbers that
> one could later compare with some of the commercial caching
> products.

In my experience, mod_proxy has good performance for cached pages.  It's
not as good as Apache's static file performance (has to hash the filename,
check the expiration, etc. on each hit), but more than good enough for a
medium sized site to run on one and a cluster of them can handle tons of
traffic.  The processes are very small, so you can actually load up a
Linux/Intel box with RAM and set MaxClients at a a few hundred.

- Perrin


Re: mod_proxy caching documentation

Posted by Joshua Chamas <jo...@chamas.com>.
Dan McCormick wrote:
> 
> Hi,
> 
> After struggling with trying to figure out mod_proxy's caching algorithm
> and noting from the list archive's that others had, too -- and due to
> the dearth of existing documentation on the subject -- I came up with
> some documentation below by sifting through the source code.  Most of it
> isn't explicitly mod_perl-related, but I hope those trying to set it up

Thanks for the read.  Very enlightening.  I'm guessing
the dir levels matters because it lets the files be
spread over that many more directories, so there isn't a 
large directory hashing penalty on a HUGE number of files.
5 is probably a bit much though if it really creates 4-5
directories for each file it stores, and if you are using
this only for a proxy in reverse mode for mod_perl, its likely
you could get away with 2-3 levels.

I think it would be interesting if you chronicled the capacity 
improvements to your site using the mod_proxy server like this.  
I don't know how well mod_proxy does this caching from a performance
perspective, and it might be nice to see some numbers that
one could later compare with some of the commercial caching
products.

--Joshua

> will find it useful.  Included at the end is a Perl script to determine
> the filename that mod_proxy uses to cache files, which is helpful in
> manually cleaning up the cache.  If anyone has comments or
> suggestions, please let me know.
> 
> Thanks,
> Dan
> 
> ------------------------------------
> 
> Setting up Apache with mod_proxy to cache content from a mod_perl server
> 
> The documentation for mod_proxy can be found at
> http://httpd.apache.org/docs/mod/mod_proxy.html.  Unfortunately, aside
> from the configuration parameters, not much detail is provided on how to
> set up mod_proxy to cache pages from a downstream server.  This
> explanation hopes to fill that void.  Most of its content was derived by
> going through the mod_proxy.c, proxy_cache.c, and proxy_util.c source
> files and comments in the src/modules/proxy directory of the Apache
> 1.3.12 distribution.
> 
> * The Short Story
> 
> In short, mod_proxy will cache all requests that contain a Last-Modified
> header and an Expires header.  You can insert this into your mod_perl
> scripts with something like this:
> 
> use Apache::File ();
> use HTTP::Date;
> 
> $r->set_last_modified((stat $r->finfo)[9]); # see Eagle book p. 493 for
> explanation
> $r->header_out('Expires', HTTP::Date::time2str(time + 24*60*60)); #
> expires in one day
> 
> The page will live in the cache until the current time passes the time
> defined by the Expires header or the time since the file was cached
> exceeds the CacheMaxExpire parameter as set in the server config file.
> 
> * The Long Story
> 
> To understand how the caching proxy server works, let's trace the flow
> of two simple HTTP exchanges for the same file, from the browser request
> to the returned page.
> 
> - The browser makes a request to the proxy server like this:
> 
> GET /index.html HTTP/1.0
> 
> - The proxy server takes the URL and converts it to a filename on your
> filesystem.  This filename has no resemblance to the actual URL.
> Instead, it is an MD5 hash of the fully qualified URL (e.g.
> http://www.myserver.com:80/mypage.html) to the document and is broken up
> in a number of directory levels, as defined by the CacheDirLevels
> parameter in the config file.  (WHY DOES IT MATTER HOW MANY DIR LEVELS
> ARE IN THE CACHE?)  Each of these directories will have a certain number
> of characters in its name, as defined by the CacheDirLength parameter in
> the config file.  The directories will live under CacheRoot, also
> defined in the config file.  For example, /index.html might be converted
> to /proxy_cache/m/EYRopVKBHMrHd2VF6WXOQ (with CacheDirLevels and
> CacheDirLength set to 1 and CacheRoot set to /proxy_cache).
> 
> - For this example, we'll assume that at this point the cached file does
> not exist.  The proxy server then consequently forwards the request to
> the mod_perl server and gets a response back.  The response will then be
> cached UNLESS any of the following conditions are true
> (ap_proxy_cache_update):
>  - The HTTP status returned by the mod_perl server is not one of OK,
> HTTP_MOVED_PERMANENTLY, or HTTP_NOT_MODIFIED
>  - The response does not contain an Expires header
>  - The response contains an Expires header that Apache can't parse
>  - The HTTP status is OK but there's not a Last-Modified header
>  - The mod_perl server sent only an HTTP header
>  - The mod_perl server sent an Authorization field in the header
> (Furthermore, if any of the above conditions are met, any existing
> cached file will be deleted.)
> 
> - If the server decides to cache the file, it will store the file
> exactly as it was received from the mod_perl server, with the addition
> of a one-line header at the start of the file.  This header contains the
> following information in the following format:
> <current time> <last modified time> <expiration time> <"version">
> <content length>
> 
> All times are stored as hex seconds since 1970 and are taken from the
> HTTP header sent by the mod_perl server.  If the current time cannot be
> parsed from this header, the proxy server determines the current time
> itself and uses that; if the Last Modified time cannot be parsed, it is
> set to the Last Modified time of the existing cached file, if it exists;
> if the Last Modified time is in the future, it is set to the current
> time as determined previously; if the Expires time cannot be parsed and
> a Last Modified time exists from the previous step, then the Expires
> time is set to "now + min((date - lastmod) * factor, maxexpire)" (as
> noted in the source code comments) where factor and maxexpire are the
> CacheLastModifiedFactor and CacheMaxExpire parameters in the config
> file; if the Expires time cannot be parsed and there is no Last Modified
> time, then the Expires time is set to "now + defaultexpire", where
> 'defaultexpire' is the CacheDefaultExpire parameter in the config file.
> 
> The "version" number stored in this file is an integer that is
> incremented each time the file is overwritten by a fresh response from
> the mod_perl server.
> 
> The permissions on the cached files are quite strict: they can be read
> and written only by the web server user.  Furthermore, the directories
> created in the cache filesystem can only be viewed by the web server
> user.
> 
> - If the status sent by the mod_perl server was a "304 Not Modified"
> header and the "Last Modified" time, as determined in the steps above,
> is before the "If-Modified-Since" time sent by the browser, then the
> proxy server sends a "304 Not Modified" response to the browser.
> Otherwise, the full file, as returned by the mod_perl server, is sent to
> the browser.
> 
> - Time passes.
> 
> - The browser makes another request for the file.  The URL is again
> converted to a filename and this time the file is found in the cache.
> At this point, the following checks are performed (ap_proxy_cache_check)
> and, if all are true, the server proceeds to the next step.  If any are
> false, the server does not use the cached file:
>  - The request is a GET request
>  - There is no 'Pragma: No-Cache' in the HTTP header sent by the browser
>  - There is no 'Authorization' field in the HTTP header sent by the
> browser
> 
> (NOTE this should mean that all HEAD requests are passed through to the
> mod_perl server.  However, in practice, this does not seem to be the
> case.  Instead, HEAD requests are passed through unless there is an
> unexpired file in the cache (retrieved via a previous GET request), in
> which case that is used.  I may be misreading the code -- the check for
> the GET request is on line 714 of proxy_cache.c, if you're interested.)
> 
> - If the above conditions are true, the proxy server opens the cached
> file, examines the first line of data, and follows this logic:
> 
>         If the "Expires" time listed in the first line of the cached
> file has not been reached then it will use the cached file.  It must
> then decide whether to send the file or just send a "304 Not Modified"
> header.  If the "If-Modified-Since" time sent by the browser is greater
> than or equal to the "Last-Modified" time in the cached file then the
> proxy server sends a "304 Not Modified" response back to the browser,
> telling it to use its locally cached copy of the file; otherwise, it
> sends the cached file.
> 
>         If the "Expires" time *has* been reached, the proxy server then
> re-requests the file from the mod_perl server, sends that back to the
> client, and writes the new response to the cache file.
> 
> Various Question:
> 
> * Is / cached separately from /index.html?
> 
> Yes.  The cache filenames are based on the URL before any aliasing takes
> place.
> 
> * How can I tell if mod_proxy is caching requests?
> 
> Open two terminal windows and tail the output of the access logs (i.e.,
> 'tail -f access_log') on both the proxy and the mod_perl server.  Then,
> use your browser to make a request to the proxy server and watch both
> logs.  If you see your request in the mod_perl server access log, the
> file's not being cached; if you don't, it is.
> 
> * Can I store the cache on an NFS server used by two or more httpd
> binaries serving the same document root?
> 
> Yes.  The servers will all use the same names for the cache files.
> 
> * How are HEAD requests handled?
> 
> HEAD requests are passed to the mod_perl server UNLESS the URL has been
> cached previously with a GET request, in which case they are served from
> the cache.
> 
> * Is there a quick hack I can use to include the Expires and
> Last-Modified headers in my Apache::ASP scripts?
> 
> Yes.  Throw this into your global.asa file:
> 
> sub Script_OnStart {
>         $Response->{Expires} = 60*60*24; # expires in a day
>         my $last_modified = (stat $0)[9];
>         $main::Response->AddHeader ('Last-Modified',
> HTTP::Date::time2str $last_modified);
> }
> 
> This will add an Expires field (of one day in the future) and a
> Last-Modified field (of the file modification time) to all your pages.
> 
> * How does the garbade collection system work?
> 
> I don't know; I didn't investigate that.  Sorry.  Presumably, it combs
> the cache every CacheGcInterval hours, as defined in the config file,
> and deletes files if the cache is greater than CacheSize, also defined
> in the config file.  Exactly *which* files are deleted is still a
> mystery.
> 
> * How do I clear the cache?
> 
> The proxy server will re-request a file when it's expiration date, as
> stored in the first line of the cached file, has been reached.  So you
> could write a routine to change that expiration date.  Or you could just
> delete the file.  But finding the file to delete is tricky.  Here's a
> script that was ported from the Apache C code that should work (NOTE
> this works only for case-sensitive filesystems; a slightly separate
> algorithm is used for case-insensitive filesystems -- see the
> proxy_util.c in the Apache sources):
> 
> #!/usr/bin/perl
> # Convert $URL to a mod_proxy cache filename
> # Ported blindly from src/modules/proxy/proxy_util.c in the Apache
> 1.3.12 distribution
> 
> use strict;
> use Digest::MD5 qw(md5);
> 
> my $URL = 'http://www.myserver.com:80/myfile.html'; # this should be the
> URL that the proxy server is fetching from the mod_perl server
> 
> my $ndepth = 1; # set to CacheDirLevels in your proxy conf file
> my $nlength = 1; # set to CacheDirLength in your proxy conf file
> 
> my @digest = split //, md5($URL);
> my @enc_table = split //,
> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_@";
> 
> my $x = ''; my @tmp = ();
> my ($i, $k, $d);
> for ($i = 0, $k = 0; $i < 15; $i += 3) {
>     $x = (ord($digest[$i]) << 16) | (ord($digest[$i + 1]) << 8) |
> ord($digest[$i + 2]);
>     $tmp[$k++] = $enc_table[$x >> 18];
>     $tmp[$k++] = $enc_table[($x >> 12) & 0x3f];
>     $tmp[$k++] = $enc_table[($x >> 6) & 0x3f];
>     $tmp[$k++] = $enc_table[$x & 0x3f];
> }
> 
> # one byte left
> $x = ord($digest[15]);
> $tmp[$k++] = $enc_table[$x >> 2];   # use up 6 bits
> $tmp[$k++] = $enc_table[($x << 4) & 0x3f];
> 
> # now split into directory levels
> 
> my @val = ();
> for ($i = $k = $d = 0; $d < $ndepth; ++$d) {
> #   memcpy(&val[i], &tmp[k], nlength);
>     @val[$i..($i+$nlength)] = @tmp[$k..($k+$nlength)];
> 
>     $k += $nlength;
>     $val[$i + $nlength] = '/';
>     $i += $nlength + 1;
> }
> 
> #memcpy(&val[i], &tmp[k], 22 - k);
> @val[$i..($i+22-$k)] = @tmp[$k..22];
> 
> print join ('', @val), "\n";