You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modproxy-dev@apache.org by Graham Leggett <mi...@sharp.fm> on 2001/02/23 03:23:56 UTC
NewCache - a requirements spec v0.01

Hi all,

This is a preliminary discussion about a proposed caching module in
Apache v2.0. It's a sort of a requirements specification, if you will.

The design is based entirely on proxy caching described in RFC2616, and
is rather tricky - as a result I've tried to describe things very
simplistically at the beginning, and then layering each new piece of
complexity so that the big picture is not overwhelming.

===

mod_cache
=========

Requirements
------------

The purpose of any cache is to make the transfer of information through
or from a system more efficient. A cache is a tradeoff between a number
of attributes, in our case the tradeoffs are:

- Bandwidth conservation - We want to transfer as few bytes over the
network as possible.

- CPU cycle conservation - We want our webservers to do as little
crunching as is possible. Less crunching means less computing
horsepower, and thus a smaller and faster server.

- Memory - We cache data in memory - memory is traded off for
performance above.

- Disk - We cache data to disk - disk space is traded off for
performance.

- Caching everything - We cache all data, from static data on disk, to
dynamically generated data, to data pulled from another server through a
reverse proxy.

- We use the control techniques described in RFC2616 as a "public
cache".

- THE DESIGN MUST BE EASY TO FOLLOW AND UNDERSTAND.


Caching - The Simple View
-------------------------

There are two tasks a cache module must perform at a basic level:

- Place new cached data into the cache
- Serve cached data from the cache

These two functions are handled by two separate halves of the cache: A
content generator "Cache Out", and a filter "Cache In":


                +-------------------------+
                |         Browser         |
                +-------------------------+
                    |              ^  ^
                    |              |  |
                    v              |  |
             +-----------+   Y     |  |
             | Cache Out |---------+  |
             +-----------+            |
                    |                 |
                    | N          +----------+
                    |            | Cache In |
                    |            +----------+
                    v                 ^
             +-------------+          |
             |    Apache   |----------+
             +-------------+ 

Very simplistically described, a request from a webbrowser is first
intercepted by the "Cache Out" content generator. If the request is
cached, the cached data is returned and the request ends immediately. If
not, the content generator does nothing and the rest of Apache is
responsible for generating the content.

At the other end, the "Cache In" filter is responsible for putting
content generated by Apache into the cache. This module directs data
either to memory or to disk (or a combination of both) depending on the
configuration of the cache.


Caching - The Slightly More Complicated View
--------------------------------------------

Of course, caching isn't actually this easy. Some complications set in
when we note that data is not only either "inside" or "not inside" the
cache, but also of varying freshness as well.

RFC2616 describes mechanisms for specifying how long an item in the
cache can remain fresh. When a cached entity expires and is no longer
fresh, we do not simply discard the cached data - instead the "Cache
Out" content generator modifies the browser request slightly to change
the request to a conditional request and hand the browser request down
to the rest of Apache.

The "Cache In" filter looks at the result of this conditional request.
If the result is "304 Not Modified", then the "Cache In" filter fulfils
the request from the cache just as the "Cache Out" content generator
would have at the start. 

If the result is not "304 Not Modified" it means there will be new data
on the way. The "Cache In" filter places the data in the cache as normal
replacing whatever was there before, and the data is passed to the
browser as normal.


              +----------------------------------------+
              |         Browser                        |
              +----------------------------------------+
                  |                ^             ^  ^
                  |                |             |  |
                  v                | Y           |  |
           +-----------+  Y      +-----------+   |  |
           | Cache Out |-------->| Cache Out |   |  +-----+
           | in cache? |         | fresh?    |   |        |
           +-----------+         +-----------+   |    +----------+
                  | N                      | N   |    | Cache In |
                  | +-------------------+  |     |    | serve    |
                  +-| Cache Out         |<-+     |    | from     |
                  | | force conditional |        |    | cache    |
                  | +-------------------+        |    +----------+ 
                  |                              |        |
                  v                            N |      Y |
           +-------------+              +---------------------+
           |    Apache   |--------------| Cache In            |
           +-------------+              | force conditional & |
                                        | 304 Not Modified?   | 
                                        +---------------------+

In addition to the above RFC2616 also defines ways to determine whether
an object is cachable or not. Depending on the value of the
Cache-Control (and possibly other) headers, the "Cache In" and "Cache
Out" modules decide whether an object is cacheable at all. If not, these
modules take action to tell the "Storage Manager" (coming soon) to
delete the objects from the cache if necessary.
  

Caching - The Plot Thickens
---------------------------

Yes, it gets even more complicated, but not really.

HTTP/1.1 (RFC2616) supports content negotiation. In a nutshell this
means that a single URL can have a number of representations: The
language might be different, or the data might have a special content
encoding, or it might be compressed. This means that different browsers
can get different data in response to the same request for the same URL.
The cache needs to handle this in an intelligent fashion.

To do this, we break down the cache code again and introduce a new bit:

- "Cache Out" - The content generator
- "Cache In" - the filter
- "Storage Manager" - the bit that handles the actual storing of the
data, either on disk or in RAM.

To keep the cache code simple we say that the "Cache Out" and "Cache In"
modules have no knowledge whatsoever of content negotiation. All they do
is give the URL and the request headers to the "Storage Manager", and
using the combination of URL and request headers the "Storage Manager"
makes the decision as to whether an object is cached or not, or whether
an object should be replaced. 

So, we could see four (or more) different objects in the cache for the
same URL, each with their own independantly defined freshness, and each
treated entirely separately from the other:


                                                 +------------+
                                           +-----| Normal     |
                          +---------+      |     +------------+
                  +------>| English |------+
                  |       +---------+      |     +------------+
                  |                        +-----| Compressed |
   +-------+      |                              +------------+
   |  URL  |------+
   +-------+      |                              +------------+
                  |                        +-----| Normal     |
                  |       +---------+      |     +------------+
                  +------>| French  |------+
                          +---------+      |     +------------+
                                           +-----| Compressed |
                                                 +------------+


The "Storage Manager" is a modular design - add on modules allow you to
cache to shared memory, or disk, or to other cache storage mechanisms
still to be invented.


Caching - The Complicated Bit
-----------------------------

Just when you thought that was it!

It has been pointed out that storing both compressed and uncompressed
versions of the same object representation in the cache is a waste of
resources. Although the cache tries very hard to remain transparent to
the content that is being cached, there are some optimisations that can
be made to speed up the process. The best place for this to happen is in
an "Optimisation Layer" sandwiched between the "Cache In" and "Cache
Out" modules, and the "Storage Manager".


   +-----------+
   | Cache Out |-----+
   +-----------+     |    +--------------------+    +-----------------+
                     +--->| Optimisation Layer |--->| Storage Manager |
   +-----------+     |    +--------------------+    +-----------------+
   | Cache In  |-----+
   +-----------+

The optimisation layer is designed to perform some optimations on the
data going into and out of the cache.

Some optimisations include:

- Compression:

If uncompressed data is being put into the "Storage Manager", the
"Optimisation Layer" compresses the data before putting it in the cache.

If uncompressed data is requested from the "Storage Manager", the
"Optimisation Layer" will uncompress the data on the fly before passing
it on back to either the "Cache In" or "Cache Out" modules.

In both of these cases, neither the "Cache In", "Cache Out" nor "Storage
Manager" modules need worry about these optimisations.

These optimisations also need not depend at all on other modules in
Apache, such as mod-gzip.



====

Regards,
Graham
-- 
-----------------------------------------
minfrin@sharp.fm		"There's a moon
					over Bourbon Street
						tonight..."