You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Brian Akins <br...@turner.com> on 2006/05/01 14:46:14 UTC

Re: Possible new cache architecture

Graham Leggett wrote:
>
> The potential danger with this is for race conditions to happen while 
> expiring cache entries. If the data entity expired before the header 
> entity, it potentially could confuse the cache - is the entry cached or 
> not? The headers say yes, data says no.

Nope.  Look at the way the current http cache works. An http "object," 
headers and data, is only valid if both headers and data are valid.

> Each variant should be an independent cached entry, the cache should 
> allow different variants to be cached side by side.

Yes.  Each is distinguished by its key.

>> As far as mod_cache is concerned these are 3 independent entries, but 
>> mod_http_cache knows how to "stitch" them together.
>>
>> mod_cache should *not* be HTTP specific in any way.
> 
> mod_cache need not be HTTP specific, it only needs the ability to cache 
> multiple entities (data, headers) under the same key, 

No.

> In other words, there must be the ability to cache by a key and a subkey.

No. mod_http_cache generates new keys for headers (key.header) data 
(key.data) and each variant (key1.header, key2.header, key1.daya... 
etc.).  As far as the underlying generic cache is concerned, they are 
all independent entries.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

William A. Rowe, Jr. wrote:

> And, of course, inserting the hit once it's composed is important, and can
> happen in parallel (3 clients looking for the same, and then fetching the
> same page from the origin).  But it's harmless if the insertion is mutex
> protected, and the insertion can only happen once the page is fetched
> complete.

in the case of mod_disk_cache the way I would do it is to have a 
deterministic tempfile rather than user apr_tempfile and opening it EXCL.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Brian Akins wrote:
> Graham Leggett wrote:
> 
>> That's two hits to find whether something is cached.
> 
> You must have two hits if you support vary.

Well, one to three hits.  One, if you use an arbitrary page (MRU or most
frequently referenced would be most optimial, but it really doesn't matter)
and then determine what varies, and if you are in the right place, or what
that right place is (page by language, or whatever fields it varied by.)

Three hits or more if your variant also varies ;)

>> How are races prevented?
> 
> shouldn't be any.  something is in the cache or not.  if one "piece" of 
> an http "object" is not valid or in cache, the object is invalid. 
> Although other variants may be valid/in cache.

And, of course, inserting the hit once it's composed is important, and can
happen in parallel (3 clients looking for the same, and then fetching the
same page from the origin).  But it's harmless if the insertion is mutex
protected, and the insertion can only happen once the page is fetched
complete.

Re: Possible new cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Graham Leggett wrote:
> Brian Akins wrote:
> 
>>> That's two hits to find whether something is cached.
>>
>>
>> You must have two hits if you support vary.
> 
> 
> You need only one - bring up the original cached entry with the key, and 
> then use cheap subkeys over a very limited data set to find both the 
> variants and the header/data.
> 
>>> How are races prevented?
>>
>>
>> shouldn't be any.  something is in the cache or not.  if one "piece" 
>> of an http "object" is not valid or in cache, the object is invalid. 
>> Although other variants may be valid/in cache.
> 
> 
> I can think of one race off the top of my head:
> 
> - the browser says "send me this URL".
> 
> - the cache has it cached, but it's stale, so it asks the backend 
> "If-None-Match".
> 
> - the cache reaper comes along, says "oh, this is stale", and reaps the 
> cached body (which is independant, remember?). The data is no longer 
> cached even though the headers still exist.
> 
> - The backend says "304 Not Modified".
> 
> - the cache says "cool, will send my copy upstream. Oops, where has my 
> data gone?".

I think that can be avoided by, instead of reaping the cached body, actually
setting aside the cached body (public > private), by changing it's key or
whatnot.  Then - throw it away after the backend says "200 OK", and replace
it with something new.  Or, rekey it a second time (private > public) when
the backend reports "304 NOT MODIFIED".

In the race, one will set it aside looking for another, the second will make
a fresh request (it doesn't see it in the cache), and either the first or
second request will wrap up -last- to place the final copy back into the
cache, replacing the document from the winner.  No harm no foul.

Bill

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Mon, 01 May 2006 22:46:44 +0200
Graham Leggett <mi...@sharp.fm> wrote:

> Brian Akins wrote:
> 
> >> That's two hits to find whether something is cached.
> > 
> > You must have two hits if you support vary.
> 
> You need only one - bring up the original cached entry with the key, and 
> then use cheap subkeys over a very limited data set to find both the 
> variants and the header/data.
> 
> >> How are races prevented?
> > 
> > shouldn't be any.  something is in the cache or not.  if one "piece" of 
> > an http "object" is not valid or in cache, the object is invalid. 
> > Although other variants may be valid/in cache.
> 
> I can think of one race off the top of my head:
> 
> - the browser says "send me this URL".
> 
> - the cache has it cached, but it's stale, so it asks the backend 
> "If-None-Match".
> 
> - the cache reaper comes along, says "oh, this is stale", and reaps the 
> cached body (which is independant, remember?). The data is no longer 
> cached even though the headers still exist.
> 
> - The backend says "304 Not Modified".
> 
> - the cache says "cool, will send my copy upstream. Oops, where has my 
> data gone?".

Sorry, but this only happens in your imagination. It's pretty obvious
that mod_cache_http will handle this.

> The end user will probably experience this as "oh, the website had a 
> glitch, let me try again", so it won't be reported as a bug.

No.

> Ok, so you tried to lock the body before going to the backend, but 
> searching for and locking the body would have been an additional wasted 
> cache hit if the backend answered with its own body. Not to mention 
> having to write and debug code to do this.

Locks are not necessary, perhaps you are imaginating something very different.
If a data body disappears under mod_http_cache it is not a big deal! It will
refuse to serve the request from the cache and a new version of the page will
be cached.

> Races need to be properly handled, and atomic cache operations will go a 
> long way to prevent them.

I think we are discussing apples and oranges. First, we only want to *organize*
the current cache code into a more layered solution. The current semantics won't
change, yet!

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

On 5/2/06, Brian Akins <br...@turner.com> wrote:
> Gonzalo Arana wrote:
>
> > What problems have you seen with this approach?  postfix uses this
> > architecture, for instance.
>
> Postfix implements SMTP, which is an asynchronous protocol.

and which problems may bring this approach?

> > Excuse my ignorance, what does "event mpm ... keep the balance very
> > good" mean?
>
> Not all your threads are tied up doing keepalives, for example.

ah, I see (I was unfamiliar with event MPM, sory).

--
Gonzalo A. Arana

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Gonzalo Arana wrote:

> What problems have you seen with this approach?  postfix uses this
> architecture, for instance.

Postfix implements SMTP, which is an asynchronous protocol.

> Excuse my ignorance, what does "event mpm ... keep the balance very 
> good" mean?

Not all your threads are tied up doing keepalives, for example.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

On 5/2/06, Brian Akins <br...@turner.com> wrote:
> Gonzalo Arana wrote:
> > A more suitable design for this task I think would be to make each
> > process to have a special purpose: cache maintenance (purging expired
> > entries, purging entries to make room for new ones, creating new
> > entries, and so on), request processing (network/disk I/O, content
> > filtering, and so on), or what ever.
>
> In my experience, this always sounds good in theory, but just doesn't
> ever work in the real world.  The event mpm is "sorta" a step in that
> direction, but seems to keep the balance pretty good.

What problems have you seen with this approach?  postfix uses this
architecture, for instance.

Excuse my ignorance, what does "event mpm ... keep the balance very good" mean?

--
Gonzalo A. Arana

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Gonzalo Arana wrote:
> A more suitable design for this task I think would be to make each
> process to have a special purpose: cache maintenance (purging expired
> entries, purging entries to make room for new ones, creating new
> entries, and so on), request processing (network/disk I/O, content
> filtering, and so on), or what ever.

In my experience, this always sounds good in theory, but just doesn't 
ever work in the real world.  The event mpm is "sorta" a step in that 
direction, but seems to keep the balance pretty good.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

Seems to me that the thundering herd / performance degradation is
inherent to apache design: all threads/processes are exact clones.

A more suitable design for this task I think would be to make each
process to have a special purpose: cache maintenance (purging expired
entries, purging entries to make room for new ones, creating new
entries, and so on), request processing (network/disk I/O, content
filtering, and so on), or what ever.

This way, performance degradation caused by cache mutex can be
minimized.  Request processors would only get queued/locked when
querying the cache, which can be made a single operation if cache is
smart enough to figure out the right response from original request,
right?

Regards,

--
Gonzalo A. Arana

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 5:50 pm, Brian Akins said:

> This seems more like a wish list.  I just want to separate out the cache
> and protocol stuff.

HTTP compliance isn't a wish, it's a requirement. A patch that breaks
compliance will end up being -1'ed.

The thundering herd issues are also a requirement, as provision was made
for it in the v2.0 design. The cache must deliver what the HTTP cache
requires (which in turn delivers what users require), not the other way
around.

Separating the cache and the protocol has advantages, but it also has the
disadvantage that fixing bugs like thundering herd may require interface
changes, forcing people to have to wait for major version number changes
before they see their problems fixed.

In this scenario, the separation of cache and protocol is (very) nice to
have, but not so nice that end users are disadvantaged.

>> - The ability to amend a subkey (the headers) on an entry that is
>> already
>> cached.
>
> mod_http_cache should handle.  to new mod_cache, it's just another
> key/value.

How does mod_http_cache do this without the need for locking (and thus
performance degradation)?

How does mod_cache guarantee that it won't expire the body without
atomically expiring the headers with it?

>> - The ability to invalidate a particular cached variant (ie headers +
>> data) in one atomic step, without affecting threads that hold that
>> cached
>> entry open at the time.
>
> mod_http_cache should handle.

Entry invalidation is definitely mod_cache's problem, it falls under cache
size maintenance and expiry.

Remember that mod_http_cache only runs when requests are present, entry
invalidation has to happen whether there are requests present or not, via
a separate thread, separate process, cron job, whatever.

>> - The ability to read from a cached object that is still being written
>> to.
>
> Nice to have.  out of scope for what I am proposing.  new mod_cache
> should be the place to implement this if underlying provider supports it.

It's not nice to have, no. It's a real problem that has inspired people to
log bugs, and very recently, for one person to submit a patch.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:

> To be HTTP compliant, and to solve thundering herd, we need the following
> from a cache:

This seems more like a wish list.  I just want to separate out the cache 
and protocol stuff.

> - The ability to amend a subkey (the headers) on an entry that is already
> cached.

mod_http_cache should handle.  to new mod_cache, it's just another 
key/value.

> - The ability to invalidate a particular cached variant (ie headers +
> data) in one atomic step, without affecting threads that hold that cached
> entry open at the time.

mod_http_cache should handle. Keep a list of variants cached - this 
should use a provider interface as well.  mod_cache would handle 
whatever locking, ref counting, etc, needs to be done, if any.

> - The ability to read from a cached object that is still being written to.

Nice to have.  out of scope for what I am proposing.  new mod_cache 
should be the place to implement this if underlying provider supports it.

> - A guarantee that the result of a broken write (segfault, timeout,
> connection reset by peer, whatever) will not result in a broken cached
> entry (ie that the cached entry will eventually be invalidated, and all
> threads trying to read from it will eventually get an error).

agreed.  new mod_cache should handle this.

> Certainly separate the protocol from the physical cache, just make sure
> the physical cache delivers the shopping list above :)

Most seem like protocol specific stuff.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 5:27 pm, Brian Akins said:

> Still not sure how this is different from what we are proposing.  we
> really want to separate protocol from cache stuff.  If we have a
> "revalidate" for the generic cache it should address all your concerns.
> ???

To be HTTP compliant, and to solve thundering herd, we need the following
from a cache:

- The ability to amend a subkey (the headers) on an entry that is already
cached.

- The ability to invalidate a particular cached variant (ie headers +
data) in one atomic step, without affecting threads that hold that cached
entry open at the time.

- The ability to read from a cached object that is still being written to.

- A guarantee that the result of a broken write (segfault, timeout,
connection reset by peer, whatever) will not result in a broken cached
entry (ie that the cached entry will eventually be invalidated, and all
threads trying to read from it will eventually get an error).

Certainly separate the protocol from the physical cache, just make sure
the physical cache delivers the shopping list above :)

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:
> 
> The way HTTP caching works is a lot more complex than in your example, you
> haven't taken into account conditional HTTP requests.
> ...

Still not sure how this is different from what we are proposing.  we 
really want to separate protocol from cache stuff.  If we have a 
"revalidate" for the generic cache it should address all your concerns.  ???

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

On 5/3/06, Graham Leggett <mi...@sharp.fm> wrote:
> Gonzalo Arana wrote:
>
> > again, I am in the dark: why do cache request headers may need to be
> > replaced or edited in the same entity?
>
> It's a requirement of the HTTP/1.1 spec.
>
> <snip>
> non-modified response headers to conditional requests need to update
> cached response headers.
> </snip>

> <snip>
> we should try to avoid 'dialog' with cache backend.
> </snip>
>
> The catch is when the server sent "304 Not Modified" - you need to
> update your cache to say "yep, my cached entry is still fresh", ie
> update the headers, without touching the body, which hasn't changed.

I see the light now :).

Having a single cache_admin proc/thread would make this easier, since
any operation can be presented as atomic, while it may require more
than a single syscall (I know, the goal is avoid full entity
duplication).  Anyway, I guess a good policy is to have 'editable'
content as binary data (i.e., no variable length).  Perhaps this is
not possible anyway :(.

Of course, to avoid a 'dialog' between httpd process and cache_admin,
both cache_admin and httpd must be smart enough.

> > That's why I suggested a dedicated process/thread for cache
> > administration, which is not a good idea if too many lookups are
> > issued to this process on each request received.
>
> <snip>
>
> I think in the long run, a dedicated process is the way to go.

+1 :).

Regards,

--
Gonzalo A. Arana

Re: Possible new cache architecture

Posted by Ruediger Pluem <rp...@apache.org>.

On 05/03/2006 10:46 PM, Graham Leggett wrote:
> 
> mod_cache definitely needs cache admin, currently it's implemented as an
> external program that is called via cron, which doesn't help if you're
> on a box without cron. Cache cleaning can be done either when a

Not completely true. According to the documentation you can start it as a daemon
(-d ,http://httpd.apache.org/docs/2.2/programs/htcacheclean.html#options) that
runs periodically. Of course this daemon has to be started and configured separately
from httpd, so it may not be the final solution.

Regards

Rüdiger

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:

> I think in the long run, a dedicated process is the way to go.

I think using a provider architecture would be best and keep complexity 
out of mod_cache.  Some module(s) would implement the necessary cache 
management functions and mod_cache would push/pull/probe the "manager" 
using this interface.  The manager may or may not be tied to the storage 
provider.  We may have enough "generic interfaces" already to allow 
completely "stand alone" cache managers.

At least, that's how I would do it...


-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Gonzalo Arana wrote:

> again, I am in the dark: why do cache request headers may need to be
> replaced or edited in the same entity?

It's a requirement of the HTTP/1.1 spec.

HTTP requests can be conditional, in other words a browser (or a proxy) 
can ask a server "give me this URL, but only if it has changed from my 
cached copy".

If the server thinks that the file has changed (or Cache-Control: 
no-cache was specified), then the server will send a full response back 
headers + body, and the browser/proxy replaces it's cached copy with the 
new headers+body.

If the server thinks that the file is the same, ie it didn't change, the 
server sends back the magic code "304 Not Modified", and just the 
headers - without any body. These new headers must replace the existing 
headers in the browser/proxy's cached entry, making the cached entry 
"fresh" again. And here lies the problem.

Doing the request this way means you don't have to ask the backend "is 
my cached copy still fresh?", get an answer back "No", and then send a 
second request saying "ok then, give me the new data" - you can 
implement caching in one request.

The catch is when the server sent "304 Not Modified" - you need to 
update your cache to say "yep, my cached entry is still fresh", ie 
update the headers, without touching the body, which hasn't changed.

> That's why I suggested a dedicated process/thread for cache
> administration, which is not a good idea if too many lookups are
> issued to this process on each request received.

mod_cache definitely needs cache admin, currently it's implemented as an 
external program that is called via cron, which doesn't help if you're 
on a box without cron. Cache cleaning can be done either when a 
connection is complete in the existing process (which may be simpler to 
implement, but it runs after every connection), or it can be done as you 
suggest, where a dedicated thread/process handles this independently.

I think in the long run, a dedicated process is the way to go.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

Thanks for bringing me to the light.

On 5/3/06, Graham Leggett <mi...@sharp.fm> wrote:
> Gonzalo Arana wrote:
>
> > Excuse my ignorance in this matter, but about the 'cache sub-key'
> > issue, why not just use a generic cache (with some expiration model
> > -LRU, perhaps-) with a 'smart' comparison function?
>
> So far one of the best suggestions was from the patch posted recently,
> where the headers and body were in the same file, but where the headers
> were given "breathing room" before the cache body, so that the headers
> can be replaced (within reasonable limits).
> What this means is that each key/data entry is now a single file again
> (like in 1.3), which is much easier to clean up atomically.
>
> The problem still remains that an existing cache file's headers must be
> editable, without doing expensive operations like copying, and this

again, I am in the dark: why do cache request headers may need to be
replaced or edited in the same entity?

> editing must be atomic (no use one thread/process trying to serve
> content from the cache and halfway through, another thread tries to
> update the headers). This will require some form of locking, which may
> be too much of a performance drag, thus blowing the back-to-one-file
> idea out the water.

this makes sense, but I still do not understand the origin of the
problem (in-place header replacement).

> Problems with cache expiry though are a real problem that mod_cache
> suffers from now, and need to be fixed.

That's why I suggested a dedicated process/thread for cache
administration, which is not a good idea if too many lookups are
issued to this process on each request received.

Regards,

--
Gonzalo A. Arana

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Gonzalo Arana wrote:

> Excuse my ignorance in this matter, but about the 'cache sub-key'
> issue, why not just use a generic cache (with some expiration model
> -LRU, perhaps-) with a 'smart' comparison function?

So far one of the best suggestions was from the patch posted recently, 
where the headers and body were in the same file, but where the headers 
were given "breathing room" before the cache body, so that the headers 
can be replaced (within reasonable limits).

What this means is that each key/data entry is now a single file again 
(like in 1.3), which is much easier to clean up atomically.

The problem still remains that an existing cache file's headers must be 
editable, without doing expensive operations like copying, and this 
editing must be atomic (no use one thread/process trying to serve 
content from the cache and halfway through, another thread tries to 
update the headers). This will require some form of locking, which may 
be too much of a performance drag, thus blowing the back-to-one-file 
idea out the water.

Problems with cache expiry though are a real problem that mod_cache 
suffers from now, and need to be fixed.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

Excuse my ignorance in this matter, but about the 'cache sub-key'
issue, why not just use a generic cache (with some expiration model
-LRU, perhaps-) with a 'smart' comparison function?

We could use as key full request headers (perhaps somewhat parsed),
and as a comparison function a clever enough code to handle Vary,
entity aging and so on.

Best regards,

--
Gonzalo A. Arana

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Brian Akins wrote:

>> Moving towards and keeping with the above goals is a far higher 
>> priority than simplifying the generic backend cache interface.
> 
> This response was a perfect summation of why we do *not* run the stock 
> mod_cache here...

Having the source means you can customise and improve the code to better 
meet your needs, and in your case your modifications work for you, and 
your organisation has the resources to commission and maintain those 
modifications.

The trouble is, in order to be accepted into httpd, your modifications 
have to work for everyone else as well.

Apparently for example the problem of trying to handle subkeys under a 
main key "is mod_http_cache's problem". Ok, so mod_httpd_cache now has 
to implement locking mechanisms to try and somehow turn the elegant (but 
overly simplistic) mod_cache into a cache that is practically useful. In 
the process we slow the cache down. The whole point of the cache is to 
speed things up.

Suddenly, we lose the whole point of the exercise.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:

> Moving towards and keeping with the above goals is a far higher priority 
> than simplifying the generic backend cache interface.

This response was a perfect summation of why we do *not* run the stock 
mod_cache here...

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Generic cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Gonzalo Arana wrote:
> On 5/3/06, Brian Akins <br...@turner.com> wrote:
> 
>> Is anyone else interested in having a generic cache architecture?  (not
>> http).  I have plenty of cases were I re-invent the wheel for caching
>> various things (IP's, sessions, whatever, etc.).  It would be nice to
>> have a provider based architecture for such things.
> 
> I am. How about adding it to apr?

1. this isn't the dev@apr list, so your inquiry is 1/2 off topic

2. apr isn't a dumping ground

3. however... to the extent that this really portably solves the backend
    storage problem through an array of different providers, that well fits
    into apr-util's mission.  A *vanilla* data store.  The caching mechanics
    of request bodies will always belong in httpd.

Re: Generic cache architecture

Posted by Brian Akins <br...@turner.com>.

Gonzalo Arana wrote:

> I am. How about adding it to apr?

How about someone figuring out how to get providers into apr?  Doesn't 
look horribly hard.  Perhaps I should ask on apr-devel?


-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Generic cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

On 5/3/06, Brian Akins <br...@turner.com> wrote:
> Is anyone else interested in having a generic cache architecture?  (not
> http).  I have plenty of cases were I re-invent the wheel for caching
> various things (IP's, sessions, whatever, etc.).  It would be nice to
> have a provider based architecture for such things.

I am. How about adding it to apr?

Regards,

--
Gonzalo A. Arana

Re: RFC: rename mod_cache to mod_http_cache

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On 5/3/06, Paul Querna <ch...@force-elite.com> wrote:
> I am okay with forcing people to wait for 2.4.  Develop in trunk and/or
> devel-branches freely.  Don't worry about back porting it to the stable
> branch, IMO.

+1.  -- justin

Re: Generic cache architecture

Posted by Gonzalo Arana <go...@gmail.com>.

> >
> > Let's talk about httpd.  We have a cache of ssl sessions.  We have
> > a cache
> > of httpd response bodies.  We have a cache of ldap credentials.  A
> > really
> > thorough mod_usertrack would have a cache of user sessions.
> >
> > So really, it doesn't make sense to have these four wheels spinning
> > out of
> > sync at different stages of stability and performance.  I'm
> > strongly +1 to
> > provide this functionality once, and reuse.
>
> On the contrary, it makes no sense whatsoever to use a generic
> storage facility for cached HTTP responses in a front-end cache
> because those responses can only be delivered at maximum speed
> through a single system call IFF they are not generic.  That is
> why our front-end cache is not, and has never needed to be, a
> generic cache.

I have to disagree: indeed a single syscall implies maximum speed &
minimum memcpy (kernel to user, user to kernel), but consider that a
cached response perhaps needs to get compressed (Transfer-Encoding:
gzip, for instance).  So, there is no way to assure that a single
syscall will work.

A generic cache, if designed with propper care, could provide a
filedescriptor, which can be used with sendfile(2) or
mmap(2)/write(2)/munmap(2) or any other combination.

> A front-end cache is a completely different beast from a
> back-end cache.  It doesn't make any sense to me to try to

what do you mean by 'front end cache' and 'backend cache'?

> make them the same, and it certainly isn't elegant.  SSL
> session, ldap credentials, sessions, and all those related
> things are trivial memory blocks that *are* suitable for
> back-end caching.

> I have no objection to creating a module for back-end caching.
> I have no objection to creating five different forms of caching
> modules, each with its own qualities, that can be selected by
> configuration (preferably based on some suggested site profile).

perhaps each kind of cache could be used by different parts (SSL
session, ldap credentials, session, would use the 'backend cache'),
and HTTP would use 'front-end cache'.

> However, I see no reason to start by changing the existing
> module names and assuming that one cache fits all.

Regards,

--
Gonzalo A. Arana

Re: Generic cache architecture

Posted by Ruediger Pluem <rp...@apache.org>.


On 05/04/2006 12:35 AM, Justin Erenkrantz wrote:

> 
> For simplicity sake, I agree.  Let's call this new thing
> mod_cache_generic or mod_frobit.  However, let's not touch mod_cache
> and friends for now.
> 
> We can rearrange things later if this new "architecture" actually has
> any benefits.  I am concerned that overgeneralization is going to make
> things slower.  So, I'd prefer to see us remove code rather than add;
> but to also do it in parallel.  So, I'd like to defer touching
> mod_cache until we know we have something that is concretely better. --

+1. First have a working alternative to the current code and toss the current code
if the new one is better (whatever better means by then exactly).


Regards

Rüdiger

Re: Generic cache architecture

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.

On 5/3/06, Roy T. Fielding <fi...@gbiv.com> wrote:
> However, I see no reason to start by changing the existing
> module names and assuming that one cache fits all.

For simplicity sake, I agree.  Let's call this new thing
mod_cache_generic or mod_frobit.  However, let's not touch mod_cache
and friends for now.

We can rearrange things later if this new "architecture" actually has
any benefits.  I am concerned that overgeneralization is going to make
things slower.  So, I'd prefer to see us remove code rather than add;
but to also do it in parallel.  So, I'd like to defer touching
mod_cache until we know we have something that is concretely better. 
-- justin

Re: Generic cache architecture

Posted by Nick Kew <ni...@webthing.com>.

On Wednesday 03 May 2006 20:44, Brian Akins wrote:
> Is anyone else interested in having a generic cache architecture?  (not
> http).  I have plenty of cases were I re-invent the wheel for caching
> various things (IP's, sessions, whatever, etc.).  It would be nice to
> have a provider based architecture for such things.

Yes, I think at that basic level, your proposal is uncontroversial.
I'd like to be able to plug in alternative cacheing modules without
having to reimplement the whole thing.

I would point out there's a big "grey area" here, with regimes
such as ESI cacheing that are bastardised HTTP.  mod_cache_http
is (modulo any bugs) technically accurate for the current cache
module, but calling it mod_cache_rfc2616 might be less confusing
for users of other cacheing regimes that purport to be HTTP
(like ESI), or that run *on top of* HTTP (like some XML-based
monstrosities).

-- 
Nick Kew

Re: Generic cache architecture

Posted by Brian Akins <br...@turner.com>.

Roy T. Fielding wrote:
  > provide this functionality once, and reuse
> On the contrary, it makes no sense whatsoever to use a generic
> storage facility for cached HTTP responses in a front-end cache
> because those responses can only be delivered at maximum speed
> through a single system call IFF they are not generic.  That is
> why our front-end cache is not, and has never needed to be, a
> generic cache.

a generic cache can deliver objects in a single system call.  Thinks 
VFS.  the "generic storage facility" may be only a thin wrapper around 
something like current mod_disk_cache or it may be a memcache frontend, 
or something completely different.

Trust me, I am extremely concerned about performance.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Generic cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Roy T. Fielding wrote:
> 
> A front-end cache is a completely different beast from a
> back-end cache.  It doesn't make any sense to me to try to
> make them the same, and it certainly isn't elegant.  SSL
> session, ldap credentials, sessions, and all those related
> things are trivial memory blocks that *are* suitable for
> back-end caching.

/nod - I can appreciate the distinction.  That said, the we will be better
for grouping front end cache providers together, with one solid reference
implementation (play your optimization games in experimental providers),
and likewise a framework for the back end cache providers with one solid
reference implementation, and I suspect httpd gets a whole lot more simple
and stable with these available to third party authors.

Bill

Re: Generic cache architecture

Posted by "Roy T. Fielding" <fi...@gbiv.com>.

On May 3, 2006, at 12:53 PM, William A. Rowe, Jr. wrote:

> Brian Akins wrote:
>> Is anyone else interested in having a generic cache architecture?   
>> (not http).  I have plenty of cases were I re-invent the wheel for  
>> caching various things (IP's, sessions, whatever, etc.).  It would  
>> be nice to have a provider based architecture for such things.
>
> Let's talk about httpd.  We have a cache of ssl sessions.  We have  
> a cache
> of httpd response bodies.  We have a cache of ldap credentials.  A  
> really
> thorough mod_usertrack would have a cache of user sessions.
>
> So really, it doesn't make sense to have these four wheels spinning  
> out of
> sync at different stages of stability and performance.  I'm  
> strongly +1 to
> provide this functionality once, and reuse.

On the contrary, it makes no sense whatsoever to use a generic
storage facility for cached HTTP responses in a front-end cache
because those responses can only be delivered at maximum speed
through a single system call IFF they are not generic.  That is
why our front-end cache is not, and has never needed to be, a
generic cache.

A front-end cache is a completely different beast from a
back-end cache.  It doesn't make any sense to me to try to
make them the same, and it certainly isn't elegant.  SSL
session, ldap credentials, sessions, and all those related
things are trivial memory blocks that *are* suitable for
back-end caching.

I have no objection to creating a module for back-end caching.
I have no objection to creating five different forms of caching
modules, each with its own qualities, that can be selected by
configuration (preferably based on some suggested site profile).
However, I see no reason to start by changing the existing
module names and assuming that one cache fits all.

....Roy

Re: Generic cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Ruediger Pluem wrote:
> 
> On 05/03/2006 11:27 PM, William A. Rowe, Jr. wrote:
>>Moreso, we need more third party authors to -participate- in telling us what
>>in HTTPD-2.4 will make their module better.  And a faster cycle of 6mos-1yr
>>gives them a chance to do this and realize the benefits in the official
>>release more quickly.
> 
> But some of them will fall off the shelf (I guess especially some commercial
> ones), because they do not want to invest time again into an changing API.

Of course.  Why stall all efforts because some don't move with new technologies?

> This means that they stick with an older version and do not do releases for
> each major version, but lets say for every second.

Ok, their user's loss, not an httpd development issue.

Let's be clear here, this isn't a statement against making their lives easier
to actually do the ports - we could be a heck of a lot more helpful in module
authoring and especially in porting documetation.

> Furthermore having frequent major releases increases the backport requests from
> user side and thus likely the backport efforts, as some users stick with older
> versions for whatever reasons (e.g. third party modules).

Wrong.  We honor fewer backport requests for features.  We might entertain some
for bug fixes certainly.  But the answer's always upgrade asap for fixes, and
an upgrade's manditory for new features.

Backporting features is a vicious cycle.  No problem here with folks using
Apache 1.3 when it solves their pain.  But if they want cool feature X and
we give it to them, we support 1.3 for that much longer because this-bug
and that-bug aught to get fixed, and new feature X has a bug so we have
a subsequent release, followed by 12 more pleas for other features to be
backported.

Nip it in the bud, freeze the features in old version, and poof, people move
because they *want* the new features, it solves more of their pain, and so they
have an incentive to make *their* investment of time in migrating.  Take away
the incentive and they will not (heck, should not) migrate up to our supported
version.

> Maybe we should have a FDT (Frequently Discussed Topics) to collect the arguments ;-).

Hehe.

Re: Generic cache architecture

Posted by Ruediger Pluem <rp...@apache.org>.

On 05/03/2006 11:27 PM, William A. Rowe, Jr. wrote:

> 
> Moreso, we need more third party authors to -participate- in telling us
> what
> in HTTPD-2.4 will make their module better.  And a faster cycle of 6mos-1yr
> gives them a chance to do this and realize the benefits in the official
> release more quickly.

But some of them will fall off the shelf (I guess especially some commercial
ones), because they do not want to invest time again into an changing API.
This means that they stick with an older version and do not do releases for
each major version, but lets say for every second.
Furthermore having frequent major releases increases the backport requests from
user side and thus likely the backport efforts, as some users stick with older
versions for whatever reasons (e.g. third party modules).
I know we had this discussion about major release cycles several times in the past.
Both approaches have pros and cons. So each side can dig out its old arguments / mails
as I do not think that something essential new has been added to the pros and cons
since last time :-).
Maybe we should have a FDT (Frequently Discussed Topics) to collect the arguments ;-).

> 
> Note that 2.2 will be a year old by December, so even if this concerns you,
> we are already half way there.

One for you. In 5 month httpd 2.2 is not far away from its first birthday.
I hate infant mortality :-).

Regards

Rüdiger

Re: Generic cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Graham Leggett wrote:
> Ruediger Pluem wrote:
> 
>> Please keep in mind that some of us are dependent on commercial httpd 
>> modules,
>> whether we like it or not.
>> If the major upgrades happen in cyles shorter than a year I guess it is
>> hard to get the commercial vendors to provide them. Not everybody is that
>> innovative and fast as the ASF :-).
> 
> +1.

-0.  You forget that we were frequently breaking the API way-way-back-when,
and the good vendors kept up, and the lousy ones didn't.

Moreso, we need more third party authors to -participate- in telling us what
in HTTPD-2.4 will make their module better.  And a faster cycle of 6mos-1yr
gives them a chance to do this and realize the benefits in the official
release more quickly.

Note that 2.2 will be a year old by December, so even if this concerns you,
we are already half way there.

Bill

Re: Generic cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Ruediger Pluem wrote:

> Please keep in mind that some of us are dependent on commercial httpd modules,
> whether we like it or not.
> If the major upgrades happen in cyles shorter than a year I guess it is
> hard to get the commercial vendors to provide them. Not everybody is that
> innovative and fast as the ASF :-).

+1.

Regards,
Graham
--

Re: Generic cache architecture

Posted by Ruediger Pluem <rp...@apache.org>.

On 05/03/2006 09:53 PM, William A. Rowe, Jr. wrote:

> 
> And finally (most important) none of this needs to target 2.2.  If 2.2
> lives
> 5 months to be replaced by 2.4 - there is really no issue.  2.0 lived

Please keep in mind that some of us are dependent on commercial httpd modules,
whether we like it or not.
If the major upgrades happen in cyles shorter than a year I guess it is
hard to get the commercial vendors to provide them. Not everybody is that
innovative and fast as the ASF :-).

Regards

Rüdiger

Re: Generic cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Brian Akins wrote:
> Is anyone else interested in having a generic cache architecture?  (not 
> http).  I have plenty of cases were I re-invent the wheel for caching 
> various things (IP's, sessions, whatever, etc.).  It would be nice to 
> have a provider based architecture for such things.

Let's talk about httpd.  We have a cache of ssl sessions.  We have a cache
of httpd response bodies.  We have a cache of ldap credentials.  A really
thorough mod_usertrack would have a cache of user sessions.

So really, it doesn't make sense to have these four wheels spinning out of
sync at different stages of stability and performance.  I'm strongly +1 to
provide this functionality once, and reuse.

While we are at it, the proxy backend requester should be generic enough that
if I need to fetch, say, a trusted CA reference from a backend (dmz) server,
that code shouldn't be rewritten either.  But it's not oriented to the request
so it would be good to see some more modularity on the proxy backend while we
improve the cache middle layer.

And finally (most important) none of this needs to target 2.2.  If 2.2 lives
5 months to be replaced by 2.4 - there is really no issue.  2.0 lived too long
because 2.1.x stayed in flux to long.

Bill

Generic cache architecture

Posted by Brian Akins <br...@turner.com>.

Is anyone else interested in having a generic cache architecture?  (not 
http).  I have plenty of cases were I re-invent the wheel for caching 
various things (IP's, sessions, whatever, etc.).  It would be nice to 
have a provider based architecture for such things.



-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: RFC: rename mod_cache to mod_http_cache

Posted by Brian Akins <br...@turner.com>.

William A. Rowe, Jr. wrote:
> Not in 2.2 branch, but in trunk?  The issue is that it's half httpd, and
> half generic.  Let me mull this over.

can we separate out the http specific parts without violating Graham's 
concerns?  My whole original idea was to just do that... I was not fully 
aware of the issues in the current mod_cache.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: RFC: rename mod_cache to mod_http_cache

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Brian Akins wrote:
> Not wanting to stir the huge pot o' stuff that is going on here, but 
> what are the thoughts of renaming mod_cache to mod_http_cache? mod_cache 
> is http specific.  This would follow the general ide that mod_proxy uses.
> 
> I am not suggesting changing any functionality at this time, simply 
> renaming it to a more suitable name.

Not in 2.2 branch, but in trunk?  The issue is that it's half httpd, and
half generic.  Let me mull this over.

If we layer it, I can entirely agree that there should be a mod_http_cache
that's entirely concerned with the content negotation handshake of http.
But in all other respects, mod_cache is equally useful for other protocols
such as mod_ftp (it takes advantage of it now, only with one possible
variant because ftp doesn't speak in variants.)

Bill

Re: RFC: rename mod_cache to mod_http_cache

Posted by Paul Querna <ch...@force-elite.com>.

Graham Leggett wrote:
> Brian Akins wrote:
> 
>> Not wanting to stir the huge pot o' stuff that is going on here, but
>> what are the thoughts of renaming mod_cache to mod_http_cache?
>> mod_cache is http specific.  This would follow the general ide that
>> mod_proxy uses.
> 
> This is a good idea, but thinking about this for a bit, doing so would
> be impossible to backport to v2.2 (it would break existing configs).
> This in turn would make it more difficult for fixes that would be useful
> in v2.2 to be backported, forcing people to wait until v2.4 before
> seeing the advantages.

I am okay with forcing people to wait for 2.4.  Develop in trunk and/or
devel-branches freely.  Don't worry about back porting it to the stable
branch, IMO.

-Paul

Re: RFC: rename mod_cache to mod_http_cache

Posted by Graham Leggett <mi...@sharp.fm>.

Brian Akins wrote:

> Not wanting to stir the huge pot o' stuff that is going on here, but 
> what are the thoughts of renaming mod_cache to mod_http_cache? mod_cache 
> is http specific.  This would follow the general ide that mod_proxy uses.

This is a good idea, but thinking about this for a bit, doing so would 
be impossible to backport to v2.2 (it would break existing configs). 
This in turn would make it more difficult for fixes that would be useful 
in v2.2 to be backported, forcing people to wait until v2.4 before 
seeing the advantages.

I agree that a rename should definitely happen though, but I'd say not yet.

Regards,
Graham
--

RFC: rename mod_cache to mod_http_cache

Posted by Brian Akins <br...@turner.com>.

Not wanting to stir the huge pot o' stuff that is going on here, but 
what are the thoughts of renaming mod_cache to mod_http_cache? 
mod_cache is http specific.  This would follow the general ide that 
mod_proxy uses.

I am not suggesting changing any functionality at this time, simply 
renaming it to a more suitable name.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Roy T. Fielding wrote:

> For the record, Graham's statements were entirely correct,
> Brian's suggested architecture would slow the HTTP cache,

No. It would simplify the existing implementation.  The existing 
implementation, as Graham has noted, is not "fully functional."  Graham 
argues - and I'm still mulling it over - that a generic cache 
architecture would get in the way of making a fully functional http cache.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Wed, 3 May 2006 11:39:02 -0700
"Roy T. Fielding" <fi...@gbiv.com> wrote:

> On May 3, 2006, at 5:56 AM, Davi Arnaut wrote:
> 
> > On Wed, 3 May 2006 14:31:06 +0200 (SAST)
> > "Graham Leggett" <mi...@sharp.fm> wrote:
> >
> >> On Wed, May 3, 2006 1:26 am, Davi Arnaut said:
> >>
> >>>> Then you will end up with code that does not meet the  
> >>>> requirements of
> >>>> HTTP, and you will have wasted your time.
> >>>
> >>> Yeah, right! How ? Hey, you are using the Monty Python argument  
> >>> style.
> >>> Can you point to even one requirement of HTTP that my_cache_provider
> >>> wont meet ?
> >>
> >> Yes. Atomic insertions and deletions, the ability to update headers
> >> independantly of body, etc etc, just go back and read the thread.
> >
> > I can't argue with a zombie, you keep repeating the same  
> > misunderstands.
> >
> >> Seriously, please move this off list to keep the noise out of  
> >> people's
> >> inboxes.
> >
> > Fine, I give up.
> 
> For the record, Graham's statements were entirely correct,
> Brian's suggested architecture would slow the HTTP cache,
> and your responses have been amazingly childish for someone
> who has earned zero credibility on this list.

Fine, I do have zero credibility.

> I suggest you stop defending a half-baked design theory and
> just go ahead and implement something as a patch.  If it works,
> that's great.  If it slows the HTTP cache, I will veto it myself.

I'm already doing this.

> There is, of course, no reason why the HTTP cache has to use
> some new middle-layer back-end cache, so maybe you could just
> stop arguing about vaporware and simply implement a single
> mod_backend_cache that doesn't try to be all things to all people.
> 
> Implement it and then convince people on the basis of measurements.
> That is a heck of a lot easier than convincing everyone to dump
> the current code based on an untested theory.
> 

I just wanted to get comments (the original idea wasn't mine).

It wasn't my intention to flame anyone, I'm not mad or anything.
I was just stating my opinion. I maybe wrong, but I don't give
up easy. :)

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Roy T. Fielding wrote:
> That is a heck of a lot easier than convincing everyone to dump
> the current code based on an untested theory.

I think the idea may be a lot more tested than you think.  Most things I 
"suggest" have had an incubation period somewhere...

I'm fine with not screwing with current mod_cache.  I just think it 
should be either: renamed or made generic.  We may or may not need a 
generic mod_backend_cache.  I have posted a "psuedo-implementation" that 
got lost in the latest thread bloat.  I can repost if anyone is interested.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by "Roy T. Fielding" <fi...@gbiv.com>.

On May 3, 2006, at 5:56 AM, Davi Arnaut wrote:

> On Wed, 3 May 2006 14:31:06 +0200 (SAST)
> "Graham Leggett" <mi...@sharp.fm> wrote:
>
>> On Wed, May 3, 2006 1:26 am, Davi Arnaut said:
>>
>>>> Then you will end up with code that does not meet the  
>>>> requirements of
>>>> HTTP, and you will have wasted your time.
>>>
>>> Yeah, right! How ? Hey, you are using the Monty Python argument  
>>> style.
>>> Can you point to even one requirement of HTTP that my_cache_provider
>>> wont meet ?
>>
>> Yes. Atomic insertions and deletions, the ability to update headers
>> independantly of body, etc etc, just go back and read the thread.
>
> I can't argue with a zombie, you keep repeating the same  
> misunderstands.
>
>> Seriously, please move this off list to keep the noise out of  
>> people's
>> inboxes.
>
> Fine, I give up.

For the record, Graham's statements were entirely correct,
Brian's suggested architecture would slow the HTTP cache,
and your responses have been amazingly childish for someone
who has earned zero credibility on this list.

I suggest you stop defending a half-baked design theory and
just go ahead and implement something as a patch.  If it works,
that's great.  If it slows the HTTP cache, I will veto it myself.

There is, of course, no reason why the HTTP cache has to use
some new middle-layer back-end cache, so maybe you could just
stop arguing about vaporware and simply implement a single
mod_backend_cache that doesn't try to be all things to all people.

Implement it and then convince people on the basis of measurements.
That is a heck of a lot easier than convincing everyone to dump
the current code based on an untested theory.

....Roy

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Wed, 3 May 2006 14:31:06 +0200 (SAST)
"Graham Leggett" <mi...@sharp.fm> wrote:

> On Wed, May 3, 2006 1:26 am, Davi Arnaut said:
> 
> >> Then you will end up with code that does not meet the requirements of
> >> HTTP, and you will have wasted your time.
> >
> > Yeah, right! How ? Hey, you are using the Monty Python argument style.
> > Can you point to even one requirement of HTTP that my_cache_provider
> > wont meet ?
> 
> Yes. Atomic insertions and deletions, the ability to update headers
> independantly of body, etc etc, just go back and read the thread.

I can't argue with a zombie, you keep repeating the same misunderstands.

> Seriously, please move this off list to keep the noise out of people's
> inboxes.

Fine, I give up.

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

William A. Rowe, Jr. wrote:

> --1.  This is a development list.  If you don't want development 
> discussions,
> don't subscribe.

I was referring to the flamebait, development discussions would 
obviously remain on the list.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

Graham Leggett wrote:
> 
> Seriously, please move this off list to keep the noise out of people's
> inboxes.

--1.  This is a development list.  If you don't want development discussions,
don't subscribe.

Bill

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Brian Akins wrote:

> Does this discussion belong off-list?  I would think this is the type of 
> thing we need to discuss on this list.

The technical discussion belongs on the list, flames not.

> Is there any consensus as to how to move forward?  Do we just leave it 
> as it is currently?

There is a patch on the table, let's review it.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:

> Seriously, please move this off list to keep the noise out of people's
> inboxes.

Does this discussion belong off-list?  I would think this is the type of 
thing we need to discuss on this list.

Is there any consensus as to how to move forward?  Do we just leave it 
as it is currently?

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

On Wed, May 3, 2006 1:26 am, Davi Arnaut said:

>> Then you will end up with code that does not meet the requirements of
>> HTTP, and you will have wasted your time.
>
> Yeah, right! How ? Hey, you are using the Monty Python argument style.
> Can you point to even one requirement of HTTP that my_cache_provider
> wont meet ?

Yes. Atomic insertions and deletions, the ability to update headers
independantly of body, etc etc, just go back and read the thread.

Seriously, please move this off list to keep the noise out of people's
inboxes.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Wed, 03 May 2006 01:09:03 +0200
Graham Leggett <mi...@sharp.fm> wrote:

> Davi Arnaut wrote:
> 
> > Graham, what I want is to be able to write a mod_cache backend _without_
> > having to worry about HTTP.
> 
> Then you will end up with code that does not meet the requirements of 
> HTTP, and you will have wasted your time.

Yeah, right! How ? Hey, you are using the Monty Python argument style.
Can you point to even one requirement of HTTP that my_cache_provider
wont meet ?

> Please go through _all_ of the mod_cache architecture, and not just 
> mod_disk_cache. Also read and understand HTTP/1.1 gateways and caches, 
> and as you want to create a generic cache, read and understand mod_ldap, 
> a module that will probably benefit from the availability of a generic 
> cache. Then step back and see that mod_cache is a small part of a bigger 
> picture. At this point you'll see that as nice as your idea of a simple 
> generic cache interface is, it's not going to be the most elegant 
> solution to the problem.

blah, blah.. you essentially said: "I don't want a simpler interface,
I think the current mess is more elegant."

I have shown you that I can even wrap your messy cache_provider hooks
into a much simpler one, how can anything else be more elegant ?

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Davi Arnaut wrote:

> Graham, what I want is to be able to write a mod_cache backend _without_
> having to worry about HTTP.

Then you will end up with code that does not meet the requirements of 
HTTP, and you will have wasted your time.

Please go through _all_ of the mod_cache architecture, and not just 
mod_disk_cache. Also read and understand HTTP/1.1 gateways and caches, 
and as you want to create a generic cache, read and understand mod_ldap, 
a module that will probably benefit from the availability of a generic 
cache. Then step back and see that mod_cache is a small part of a bigger 
picture. At this point you'll see that as nice as your idea of a simple 
generic cache interface is, it's not going to be the most elegant 
solution to the problem.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Tue, 02 May 2006 23:31:13 +0200
Graham Leggett <mi...@sharp.fm> wrote:

> Davi Arnaut wrote:
> 
> >> The way HTTP caching works is a lot more complex than in your example, you
> >> haven't taken into account conditional HTTP requests.
> > 
> > I've taken into account the actual mod_disk_cache code!
> 
> mod_disk_cache doesn't contain any of the conditional HTTP request code, 
> which is why you're not seeing it there.
> 
> Please keep in mind that the existing mod_cache framework's goal is to 
> be a fully HTTP/1.1 compliant, content generator neutral, efficient, 
> error free and high performance cache.
> 
> Moving towards and keeping with the above goals is a far higher priority 
> than simplifying the generic backend cache interface.
> 
> To sum up - the cache backend must fulfill the requirements of the cache 
> frontend (generic or not), which in turn must fulfill the requirements 
> of the users, who are browsers, web robot code, and humans. To try and 
> prioritise this the other way round is putting the cart before the horse.

Graham, what I want is to be able to write a mod_cache backend _without_
having to worry about HTTP. _NOT_ to rewrite mod_disk/proxy/cache/whatever!

You keep talking about HTTP this, HTTP that, I wont change the way it currently
works. I just want to place a glue beteween the storage and the HTTP part.

I could even wrap around your code:

typedef struct 
	apr_status_t (*fetch) (cache_handle_t *h, apr_bucket_brigade *bb);
	apr_status_t (*store) (cache_handle_t *h, apr_bucket_brigade *bb);
	int (*remove) (const char *key);
} my_cache_provider;

typedef struct {
	const char *key_headers;
	const char *key_body;
} my_cache_object;

create_entity:
	my_cache_object *obj;

	obj->key_headers = hash_headers(request, whatever);
	obj->key_body = hash_body(request, whatever);

open_entity:
	my_cache_object *obj;

	my_provider->fetch(h, obj->key_headers, header_brigade);

	// if necessary, update obj->key_headers/body (vary..)


remove_url:
	my_provider->remove(obj->key_header);
	my_provider->remove(obj->key_body);

remove_entity:
	nop

store_headers:
	my_cache_object *obj;
	// if necessary, update obj->key_headers (vary..)
	my_provider->store(h, obj->key_headers, header_brigade);

store_body:
	my_cache_object *obj;
	my_provider->store(h, obj->key_body, body_brigade)

recall_headers:
	my_cache_object *obj;
	my_provider->fetch(h, obj->key_headers, header_brigade);

recall_body:	
	my_cache_object *obj;
	my_provider->fetch(h, obj->key_body, body_brigade);

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Davi Arnaut wrote:

>> The way HTTP caching works is a lot more complex than in your example, you
>> haven't taken into account conditional HTTP requests.
> 
> I've taken into account the actual mod_disk_cache code!

mod_disk_cache doesn't contain any of the conditional HTTP request code, 
which is why you're not seeing it there.

Please keep in mind that the existing mod_cache framework's goal is to 
be a fully HTTP/1.1 compliant, content generator neutral, efficient, 
error free and high performance cache.

Moving towards and keeping with the above goals is a far higher priority 
than simplifying the generic backend cache interface.

To sum up - the cache backend must fulfill the requirements of the cache 
frontend (generic or not), which in turn must fulfill the requirements 
of the users, who are browsers, web robot code, and humans. To try and 
prioritise this the other way round is putting the cart before the horse.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Tue, 2 May 2006 17:22:00 +0200 (SAST)
"Graham Leggett" <mi...@sharp.fm> wrote:

> On Tue, May 2, 2006 7:06 pm, Davi Arnaut said:
> 
> > There is not such scenario. I will simulate a request using the disk_cache
> > format:
> 
> The way HTTP caching works is a lot more complex than in your example, you
> haven't taken into account conditional HTTP requests.

I've taken into account the actual mod_disk_cache code! Let me try to translate
your typical scenario.

> A typical conditional scenario goes like this:
> 
> - Browser asks for URL from httpd.

Same.

> - Mod_cache has a cached copy by looking up the headers BUT - it's stale.
> mod_cache converts the browser's original request to a conditional request
> by adding the header If-None-Match.

sed s/mod_cache/mod_http_cache

> - The backend server answers "no worries, what you have is still fresh" by
> sending a "304 Not Modified".

sed s/mod_cache/mod_http_cache

> - mod_cache takes the headers from the 304, and replaces the headers on
> the cached entry, in the process making the entry "fresh" again.

sed s/mod_cache/mod_http_cache

> - mod_cache hands the cached data back to the browser.

sed s/mod_cache/mod_http_cache

> Read http://www.ietf.org/rfc/rfc2616.txt section 13 (mainly) to see in
> detail how this works.

Again: we do not want to change the semantics, we only want to separate
the HTTP specific part from the storage specific part. The HTTP specific
parts of mod_disk_cache, mod_mem_cache and mod_cache are moved to a
mod_http_cache, while retaining the storage specific parts. And mod_cache
is the one who will combine those two layers.

Again: it's the same thing as we were replacing all mod_disk_cache file
operations by hash table operations.

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 7:06 pm, Davi Arnaut said:

> There is not such scenario. I will simulate a request using the disk_cache
> format:

The way HTTP caching works is a lot more complex than in your example, you
haven't taken into account conditional HTTP requests.

A typical conditional scenario goes like this:

- Browser asks for URL from httpd.

- Mod_cache has a cached copy by looking up the headers BUT - it's stale.
mod_cache converts the browser's original request to a conditional request
by adding the header If-None-Match.

- The backend server answers "no worries, what you have is still fresh" by
sending a "304 Not Modified".

- mod_cache takes the headers from the 304, and replaces the headers on
the cached entry, in the process making the entry "fresh" again.

- mod_cache hands the cached data back to the browser.

Read http://www.ietf.org/rfc/rfc2616.txt section 13 (mainly) to see in
detail how this works.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Tue, 2 May 2006 15:40:30 +0200 (SAST)
"Graham Leggett" <mi...@sharp.fm> wrote:

> On Tue, May 2, 2006 3:24 pm, Brian Akins said:
> 
> >> - the cache says "cool, will send my copy upstream. Oops, where has my
> >> data gone?".
> 
> > So, the cache says, okay must get content the old fashioned way (proxy,
> > filesystem, magic fairies, etc.).
> >
> > Where's the issue?
> 
> To rephrase it, a whole lot of extra code, which has to be written and
> debugged, has to say "oops, ok sorry backend about the If-None-Match, I
> thought I had it cached but I actually didn't, please can I have the full
> file?". Then the backend gives you a response with different headers to
> those you already delivered to the frontend. Oops.

There is not such scenario. I will simulate a request using the disk_cache
format:

. Incoming client requests URI /foo/bar/baz
. Request goes through mod_http_cache, Generate <hash> off of URI
. mod_http_cache ask mod_cache for the data associated with key: <hash>.header
. No data:
	. Fetch from upstream
. Data Fetched:
	. If format #1 (Contains a list of Vary Headers):
		. Use each header name (from .header) with our request
		values (headers_in) to regenerate <hash> using HeaderName+
		HeaderValue+URI
		. Ask mod_cache for data with key: <hash>.header
			. No data:
				. Fetch from upstream
			. Data:
				. Serve data to client
	. If format #2
		. Serve data to client

Where is the difference ?

> Keeping the code as simple as possible will keep your code bug free, which
> means less time debugging for you, and less time for end users trying to
> figure out what the cause is of their weird symptoms.

We are trying to get it more simple as possible by separating the storage
layer from the protocol layer.

--
Davi Arnaut
Davi Arnaut

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 3:24 pm, Brian Akins said:

>> - the cache says "cool, will send my copy upstream. Oops, where has my
>> data gone?".

> So, the cache says, okay must get content the old fashioned way (proxy,
> filesystem, magic fairies, etc.).
>
> Where's the issue?

To rephrase it, a whole lot of extra code, which has to be written and
debugged, has to say "oops, ok sorry backend about the If-None-Match, I
thought I had it cached but I actually didn't, please can I have the full
file?". Then the backend gives you a response with different headers to
those you already delivered to the frontend. Oops.

Keeping the code as simple as possible will keep your code bug free, which
means less time debugging for you, and less time for end users trying to
figure out what the cause is of their weird symptoms.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:

> - the cache says "cool, will send my copy upstream. Oops, where has my 
> data gone?".
> 
>

So, the cache says, okay must get content the old fashioned way (proxy, 
filesystem, magic fairies, etc.).

Where's the issue?

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Brian Akins wrote:

>> That's two hits to find whether something is cached.
> 
> You must have two hits if you support vary.

You need only one - bring up the original cached entry with the key, and 
then use cheap subkeys over a very limited data set to find both the 
variants and the header/data.

>> How are races prevented?
> 
> shouldn't be any.  something is in the cache or not.  if one "piece" of 
> an http "object" is not valid or in cache, the object is invalid. 
> Although other variants may be valid/in cache.

I can think of one race off the top of my head:

- the browser says "send me this URL".

- the cache has it cached, but it's stale, so it asks the backend 
"If-None-Match".

- the cache reaper comes along, says "oh, this is stale", and reaps the 
cached body (which is independant, remember?). The data is no longer 
cached even though the headers still exist.

- The backend says "304 Not Modified".

- the cache says "cool, will send my copy upstream. Oops, where has my 
data gone?".

The end user will probably experience this as "oh, the website had a 
glitch, let me try again", so it won't be reported as a bug.

Ok, so you tried to lock the body before going to the backend, but 
searching for and locking the body would have been an additional wasted 
cache hit if the backend answered with its own body. Not to mention 
having to write and debug code to do this.

Races need to be properly handled, and atomic cache operations will go a 
long way to prevent them.

Regards,
Graham
--

Re: mod_disk_cache read-while-caching patch

Posted by Niklas Edmundsson <ni...@acc.umu.se>.

On Tue, 2 May 2006, Niklas Edmundsson wrote:

<a lot of things>

>>> In any case the patch is more or less finished, independent testing
>>> and auditing haven't been done yet but I can submit a preliminary
>>> jumbo-patch if people are interested in having a look at it now.
>> 
>> Post it, people can take a look.
>
> OK. It's attached. It has only had mild testing using the worker mpm with 
> mmap enabled, it needs a bit more testing and auditing before trusting it too 
> hard.
>
> Note that this patch fixes a whole slew of other issues along the way, the 
> most notable ones being LFS on 32bit arch, don't eat all your 32bit 
> memory/address space when caching a huge files, provide r->filename so %f in 
> LogFormat works, and other smaller issues.

Now it's a fair bit more complete, and tested quite a bit with the 
worker mpm at least. Most of the time has been spent trying to figure 
out what the apr*-api is doing. For example, when things go bad (a 
client hangs up a connection for example) memory seems to be freed 
before the cleanups are run, the segfault caused by that took quite 
some time to find.

This jumbo-patch is provided for those who wants to see what the 
result is and see the general way I solved things. It still has some 
FIXME's and debug messages. For those who like to try it, note that 
the disk format is changed (one file instead of header and body in 
separate files) so you'll have to clean your httpcache to get to a 
known state. Regarding stability, we trust it enough to deploy it for 
production use.

In general, it solves LFS, read-while-caching and "thundering herd" 
and touches quite a lot of code to get there since I wanted to avoid 
the locking mess.

I'll start breaking this down into smaller patches, but given the 
recent cache-cleanup-effort I'd like to know how to do it.

* Should I provide patches against httpd-2.2.2 or trunk?
* Should I just attach them to bug #39380 or post them here first?

I suspect that the cache-cleanup-effort will make some of this 
obsolete, but since mod_disk_cache is a dark hole of no error handling 
and doing checks at weird places I suspect that at least those patches 
will prove useful.

/Nikke - looking forward to see how it survives the Ubuntu Dapper
          release ...
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  I've seen the procedure hundreds of times. - Qwark
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_disk_cache patch, preview edition (was: new cache arch)

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 3:50 pm, Niklas Edmundsson said:

> Are there partially cached files? If I request the last 200 bytes of a
> 4.3GB DVD image, the bucket brigade contains the complete file... The
> headers says ranges and all sorts of things but they don't match
> what's cached.

By "partially cached" I meant a file that was half cached, and other
processes/threads are serving content from that cache.

>> What may be useful is a cache header with some metadata in it giving the
>> total size and a "download failed" flag, which goes in front of the
>> headers. The metadata can also contain the offset of the body.
>
> I solved it with size in the body and a timeout mechanism, a "download
> failed" flag doesn't cope with segfaults.

True, but a timeout forces the end user to wait in cases where we already
know the backend is dead. This typically won't happen with a disk backend,
but it will happen with a mod_proxy backend (think connection reset by
peer).

> It's possible, but since I needed to hammer so hard at mod_disk_cache
> to get it in the shape I wanted it I set out to first get the whole
> thing working and then worry about breaking the patch into manageable
> pieces. For example, by doing it all-incremental there would have been
> a dozen or so disk format change-patches, and I really don't think you
> would have wanted that :)

We do want that if possible :) Small changes are easy to understand, and
thus in turn easy to get the three votes needed for inclusion into httpd
v2.2 from trunk.

Regards,
Graham
--

Re: mod_disk_cache patch, preview edition (was: new cache arch)

Posted by Niklas Edmundsson <ni...@acc.umu.se>.

On Tue, 2 May 2006, Graham Leggett wrote:

>> The need-size-issue goes for retrievals as well.
>
> If you are going to read from partially cached files, you need a "total
> size" field as well as a flag to say "give up, this attempt at caching
> failed"

Are there partially cached files? If I request the last 200 bytes of a 
4.3GB DVD image, the bucket brigade contains the complete file... The 
headers says ranges and all sorts of things but they don't match 
what's cached.

> What may be useful is a cache header with some metadata in it giving the
> total size and a "download failed" flag, which goes in front of the
> headers. The metadata can also contain the offset of the body.

I solved it with size in the body and a timeout mechanism, a "download 
failed" flag doesn't cope with segfaults.

>> OK. It's attached. It has only had mild testing using the worker mpm
>> with mmap enabled, it needs a bit more testing and auditing before
>> trusting it too hard.
>>
>> Note that this patch fixes a whole slew of other issues along the way,
>> the most notable ones being LFS on 32bit arch, don't eat all your
>> 32bit memory/address space when caching a huge files, provide
>> r->filename so %f in LogFormat works, and other smaller issues.
>
> Is it possibly to split the patch into separate fixes for each issue
> (where practical)? It makes it easier to digest.

It's possible, but since I needed to hammer so hard at mod_disk_cache 
to get it in the shape I wanted it I set out to first get the whole 
thing working and then worry about breaking the patch into manageable 
pieces. For example, by doing it all-incremental there would have been 
a dozen or so disk format change-patches, and I really don't think you 
would have wanted that :)

As said, this is a preliminary jumbo patch for those interested in how 
we tackled the various problems involved (or those who love to take 
bleeding edge code for a spin and watch it falling into pieces when 
hitting a weird corner case ;).

> Also the other fixes can be committed immediately/soon, depending on how
> simple they are, which will simplify the final patch.

Yup. I'll update bug#39380 when we feel that we have a good solution.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  To err is Human. To blame someone else is politics.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_disk_cache patch, preview edition (was: new cache arch)

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 2:03 pm, Niklas Edmundsson said:

>> This is great, in doing this you've been solving a proxy bug that was
>> first reported in 1998 :).
>
> OK. Stuck in the "File under L for Later" pile? ;)

Er no, it was under the "redesign the entire code to fix it" class of
bugs. :)

The v2.0 mod_cache design had provision for solving this problem, but it
was never completed. The v1.3 mod_proxy/cache design needed a major
rewrite to fix, the effort was instead put into v2.0.

> Regarding partially cached files, it understands when caching a file
> has failed and so on.

All the cache has to worry about is invalidating all partially cached
files where an upstream error occurred (timeout, connection reset by peer,
whatever) the end goal being to never inadvertantly cache a broken file.

> They are. It seek():s to an offset where the body is stored so
> headers can be updated as long as they don't grow too much.

Ok, makes sense.

> The need-size-issue goes for retrievals as well.

If you are going to read from partially cached files, you need a "total
size" field as well as a flag to say "give up, this attempt at caching
failed"

What may be useful is a cache header with some metadata in it giving the
total size and a "download failed" flag, which goes in front of the
headers. The metadata can also contain the offset of the body.

> OK. It's attached. It has only had mild testing using the worker mpm
> with mmap enabled, it needs a bit more testing and auditing before
> trusting it too hard.
>
> Note that this patch fixes a whole slew of other issues along the way,
> the most notable ones being LFS on 32bit arch, don't eat all your
> 32bit memory/address space when caching a huge files, provide
> r->filename so %f in LogFormat works, and other smaller issues.

Is it possibly to split the patch into separate fixes for each issue
(where practical)? It makes it easier to digest.

Also the other fixes can be committed immediately/soon, depending on how
simple they are, which will simplify the final patch.

Regards,
Graham
--

mod_disk_cache patch, preview edition (was: new cache arch)

Posted by Niklas Edmundsson <ni...@acc.umu.se>.

On Tue, 2 May 2006, Graham Leggett wrote:

>> I've been hacking on mod_disk_cache to make it:
>> * Only store one set of data when one uncached item is accessed
>>    simultaneously (currently all requests cache the file and the last
>>    finished cache process is "wins").
>> * Don't wait until the whole item is cached, reply while caching
>>    (currently it stalls).
>> * Don't block the requesting thread when requestng a large uncached
>>    item, cache in the background and reply while caching (currently it
>>    stalls).
>
> This is great, in doing this you've been solving a proxy bug that was
> first reported in 1998 :).

OK. Stuck in the "File under L for Later" pile? ;)

> The only things to be careful of is for Cache-Control: no-cache and
> friends to be handled gracefully (the partially cached file should be
> marked as "delete-me" so that the current request creates a new cache file
> / no cache file. Existing running downloads should be unaffected by
> this.), and for backend failures (either a timeout or a premature socket
> close) to cause the cache entry to be invalidated and deleted.

I haven't changed the handling of this, so any bugs in this regard 
shouldn't be my fault at least ;)

Regarding partially cached files, it understands when caching a file 
has failed and so on.

>> * More or less atomic operations, so caching headers and data in
>>    separate files gets very messy if you want to keep consistency.
>
> Keep in mind that HTTP/1.1 compliance requires that the headers be
> updatable without changing the body.

They are. It seek():s to an offset where the body is stored so 
headers can be updated as long as they don't grow too much.

>> * You can't use tempfiles since you want to be able to figure out
>>    where the data is to be able to reply while caching.
>> * You want to know the size of the data in order to tell when you're
>>    done (ie the current size of a file isn't necessarily the real size
>>    of the body since it might be caching while we're reading it).
>
> The cache already wants to know the size of the data so that it can decide
> whether it's prepared to try and cache the file in the first place, so in
> theory this should not be a problem.

The need-size-issue goes for retrievals as well.

You also have the "size unknown right now" issue, which this patch 
solves by writing a header with the size -1 and then updating it when 
the size is known.

>> In any case the patch is more or less finished, independent testing
>> and auditing haven't been done yet but I can submit a preliminary
>> jumbo-patch if people are interested in having a look at it now.
>
> Post it, people can take a look.

OK. It's attached. It has only had mild testing using the worker mpm 
with mmap enabled, it needs a bit more testing and auditing before 
trusting it too hard.

Note that this patch fixes a whole slew of other issues along the way, 
the most notable ones being LFS on 32bit arch, don't eat all your 
32bit memory/address space when caching a huge files, provide 
r->filename so %f in LogFormat works, and other smaller issues.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  I am Zirofsky of Borg. I will reassimilate Alaska and Finland.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

On Tue, May 2, 2006 11:22 am, Niklas Edmundsson said:

> I've been hacking on mod_disk_cache to make it:
> * Only store one set of data when one uncached item is accessed
>    simultaneously (currently all requests cache the file and the last
>    finished cache process is "wins").
> * Don't wait until the whole item is cached, reply while caching
>    (currently it stalls).
> * Don't block the requesting thread when requestng a large uncached
>    item, cache in the background and reply while caching (currently it
>    stalls).

This is great, in doing this you've been solving a proxy bug that was
first reported in 1998 :).

The only things to be careful of is for Cache-Control: no-cache and
friends to be handled gracefully (the partially cached file should be
marked as "delete-me" so that the current request creates a new cache file
/ no cache file. Existing running downloads should be unaffected by
this.), and for backend failures (either a timeout or a premature socket
close) to cause the cache entry to be invalidated and deleted.

> * More or less atomic operations, so caching headers and data in
>    separate files gets very messy if you want to keep consistency.

Keep in mind that HTTP/1.1 compliance requires that the headers be
updatable without changing the body.

> * You can't use tempfiles since you want to be able to figure out
>    where the data is to be able to reply while caching.
> * You want to know the size of the data in order to tell when you're
>    done (ie the current size of a file isn't necessarily the real size
>    of the body since it might be caching while we're reading it).

The cache already wants to know the size of the data so that it can decide
whether it's prepared to try and cache the file in the first place, so in
theory this should not be a problem.

> In any case the patch is more or less finished, independent testing
> and auditing haven't been done yet but I can submit a preliminary
> jumbo-patch if people are interested in having a look at it now.

Post it, people can take a look.

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Tue, 2 May 2006 11:22:31 +0200 (MEST)
Niklas Edmundsson <ni...@acc.umu.se> wrote:

> On Mon, 1 May 2006, Davi Arnaut wrote:
> 
> > More important, if we stick with the key/data concept it's possible to
> > implement the header/body relationship under single or multiple keys.
> 
> I've been hacking on mod_disk_cache to make it:
> * Only store one set of data when one uncached item is accessed
>    simultaneously (currently all requests cache the file and the last
>    finished cache process is "wins").
> * Don't wait until the whole item is cached, reply while caching
>    (currently it stalls).
> * Don't block the requesting thread when requestng a large uncached
>    item, cache in the background and reply while caching (currently it
>    stalls).
> 
> This is mostly aimed at serving huge static files from a slow disk 
> backend (typically an NFS export from a server holding all the disk), 
> such as http://ftp.acc.umu.se/ and http://ftp.heanet.ie/ .
> 
> Doing this with the current mod_disk_cache disk layout was not 
> possible, doing the above without unneccessary locking means:
> 
> * More or less atomic operations, so caching headers and data in
>    separate files gets very messy if you want to keep consistency.
> * You can't use tempfiles since you want to be able to figure out
>    where the data is to be able to reply while caching.
> * You want to know the size of the data in order to tell when you're
>    done (ie the current size of a file isn't necessarily the real size
>    of the body since it might be caching while we're reading it).
> 
> In the light of our experiences, I really think that you want to have 
> a concept that allows you to keep the bond between header and data. 
> Yes, you can patch up a missing bond by require locking and stuff, but 
> I really prefer not having to lock cache files when doing read access. 
> When it comes to "make the common case fast" a lockless design is very 
> much preferred.

I will repeat once again: there is no locking involved, unless your format
of storing the header/data is really wrong. _The data format is up to
the module using it_, while the storage backend is a completely different
issue.

> However, if all those issues are sorted out in the layer above disk 
> cache then the above observations becomes more or less moot.

Yes, that's the point.

> In any case the patch is more or less finished, independent testing 
> and auditing haven't been done yet but I can submit a preliminary 
> jumbo-patch if people are interested in having a look at it now.

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Niklas Edmundsson <ni...@acc.umu.se>.

On Mon, 1 May 2006, Davi Arnaut wrote:

> More important, if we stick with the key/data concept it's possible to
> implement the header/body relationship under single or multiple keys.

I've been hacking on mod_disk_cache to make it:
* Only store one set of data when one uncached item is accessed
   simultaneously (currently all requests cache the file and the last
   finished cache process is "wins").
* Don't wait until the whole item is cached, reply while caching
   (currently it stalls).
* Don't block the requesting thread when requestng a large uncached
   item, cache in the background and reply while caching (currently it
   stalls).

This is mostly aimed at serving huge static files from a slow disk 
backend (typically an NFS export from a server holding all the disk), 
such as http://ftp.acc.umu.se/ and http://ftp.heanet.ie/ .

Doing this with the current mod_disk_cache disk layout was not 
possible, doing the above without unneccessary locking means:

* More or less atomic operations, so caching headers and data in
   separate files gets very messy if you want to keep consistency.
* You can't use tempfiles since you want to be able to figure out
   where the data is to be able to reply while caching.
* You want to know the size of the data in order to tell when you're
   done (ie the current size of a file isn't necessarily the real size
   of the body since it might be caching while we're reading it).

In the light of our experiences, I really think that you want to have 
a concept that allows you to keep the bond between header and data. 
Yes, you can patch up a missing bond by require locking and stuff, but 
I really prefer not having to lock cache files when doing read access. 
When it comes to "make the common case fast" a lockless design is very 
much preferred.

However, if all those issues are sorted out in the layer above disk 
cache then the above observations becomes more or less moot.

In any case the patch is more or less finished, independent testing 
and auditing haven't been done yet but I can submit a preliminary 
jumbo-patch if people are interested in having a look at it now.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  Want to forget all your troubles? Wear tight shoes.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Mon, 01 May 2006 15:46:58 -0400
Brian Akins <br...@turner.com> wrote:

> Graham Leggett wrote:
> 
> > That's two hits to find whether something is cached.
> 
> You must have two hits if you support vary.
> 
> > How are races prevented?
> 
> shouldn't be any.  something is in the cache or not.  if one "piece" of 
> an http "object" is not valid or in cache, the object is invalid. 
> Although other variants may be valid/in cache.
> 

More important, if we stick with the key/data concept it's possible to
implement the header/body relationship under single or multiple keys.

I think Brian want's mod_cache should be only a layer (glue) between the
underlying providers and the cache users. Each set of problems are better
dealt under their own layers. The storage layer (cache providers) are going
to only worry about storing the key/data pairs (and expiring ?) while the
"protocol" layer will deal with the underlying concepts of each protocol
(mod_http_cache).

The current design leads to bloat, just look at mem_cache and disk_cache,
both have their own duplicated quirks (serialize/unserialize, et cetera)
and need special handling of the headers and file format. Under the new
design this duplication will be gone, think that we will assemble the
HTTP-specific part and generalize the storage part.

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Graham Leggett wrote:

> That's two hits to find whether something is cached.

You must have two hits if you support vary.

> How are races prevented?

shouldn't be any.  something is in the cache or not.  if one "piece" of 
an http "object" is not valid or in cache, the object is invalid. 
Although other variants may be valid/in cache.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Graham Leggett <mi...@sharp.fm>.

Brian Akins wrote:

> Nope.  Look at the way the current http cache works. An http "object," 
> headers and data, is only valid if both headers and data are valid.

That's two hits to find whether something is cached.

How are races prevented?

Regards,
Graham
--

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Davi Arnaut wrote:
> This way it would be possible for one cache to act as a cache of another
> cache provider, mod_mem_cache would work as a small/fast MRU cache for
> mod_disk_cache.

Slightly off subject, but in my testing, mod_disk_cache is much faster 
than mod_mem_cache.  Thanks to sendifle!

I was thinking about scenarios were each cache had there local cache 
(disk, mem, whatever) with memcache behind it.  That way each "object" 
only has to be generated once for the entire "farm."  This would be an 
easy way to have a distributed cache.

Also, the squid type htcp (or icp) could be a failback for the local 
cache as well without mucking up all the proxy and cache code.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies

Re: Possible new cache architecture

Posted by Davi Arnaut <da...@haxent.com.br>.

On Mon, 01 May 2006 09:02:31 -0400
Brian Akins <br...@turner.com> wrote:

> Here is a scenario.  We will assume a cache "hit."

I think the usage scenario is clear. Moving on, I would like to able to stack
up the cache providers (like the apache filter chain). Basically, mod_cache
will expose the functions:

	add(key, value, expiration, flag)
	get(key)
	remove(key)

mod_cache will then pass the request (add/get or remove) down the chain,
similar to apache filter chain. ie:

apr_status_t mem_cache_get_filter(ap_cache_filter_t *f,
                                  apr_bucket_brigade *bb, ...);

apr_status_t disk_cache_get_filter(ap_cache_filter_t *f,
                                   apr_bucket_brigade *bb, ...);

This way it would be possible for one cache to act as a cache of another
cache provider, mod_mem_cache would work as a small/fast MRU cache for
mod_disk_cache.

--
Davi Arnaut

Re: Possible new cache architecture

Posted by Brian Akins <br...@turner.com>.

Here is a scenario.  We will assume a cache "hit."

Client asks for http://domain/uri.html?args

mod_http_cache generates a key: http-domain-uri.html-args-header

asks mod_cache for value with this key.

mod_cache fetches the value, looks at expire time, its good, and returns 
the "blob"

mod_http_cache examines blob, it's vary information on Accept-Encoding.

mod_http_cache generates a new key: http-domain.html-args-header-gzip 
(value from client)

asks mod_cache for value with this key.

mod_cache fetches the value, looks at expire time, its good, and returns 
the "blob"

mod_http_cache examines blob, it's a normal header blob. does not "meet 
conditions" need to get data.

mod_http_cache generates a new key: http-domain.html-args-data-gzip 
(value from client)

asks mod_cache for value with this key.

mod_cache fetches the value, looks at expire time, its good, and returns 
the "blob"


mod_http_cache returns headers and data to client.


Notice there is a pattern to this...
-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies