You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Issac Goldstand <ma...@beamartyr.net> on 2006/09/13 21:29:54 UTC

mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Hi all,
  I've been hacking at mod_cache a bit, and was surprised to find that
part of the decision to serve previously cached content or not was being
made by the backend provider and not mod_cache; specifically, the
expiration date of the content seems to be checked by mod_disk_cache (as
part of open_entity), and if the provider check fails, mod_cache doesn't
even know about the entity (and therefore, in the case of a caching
proxy,  can't treat it as a possibly stale entity upon which it can just
do a conditional GET and possibly get a 304, rather than requiring
mod_proxy to rerequest the entire entity again).

When I originally started looking at the family of cache modules, I
assumed that all of the decision-making logic would be in mod_cache,
while the mod_xxx_cache providers would be "dumb" file-stores (at least,
as far as mod_cache is concerned).  Is this not the case?

If it is, would patches be acceptable if I have the time to try to
rectify the situation (at least somewhat)?

  Issac

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 14/09/2006, at 05:08, Issac Goldstand wrote:

> This looks familiar.  I seem to remembering seeing patches for this a
> few months back.   Were they not committed to trunk?  If not, is there
> any reason why not?  I'd hate to spend serious time making  
> modifications
> only to have to redo the work when this (pretty major) patchset gets
> committed...

Probably because I have not yet submitted it for inclusion and it's  
not finished.
A new cache-dev branch would be really nice - hint hint :)

--
Davi Arnaut

>
>
> Davi Arnaut wrote:
>>
>> On 13/09/2006, at 16:29, Issac Goldstand wrote:
>>
>>> Hi all,
>>>   I've been hacking at mod_cache a bit, and was surprised to find  
>>> that
>>> part of the decision to serve previously cached content or not  
>>> was being
>>> made by the backend provider and not mod_cache; specifically, the
>>> expiration date of the content seems to be checked by  
>>> mod_disk_cache (as
>>> part of open_entity), and if the provider check fails, mod_cache  
>>> doesn't
>>> even know about the entity (and therefore, in the case of a caching
>>> proxy,  can't treat it as a possibly stale entity upon which it  
>>> can just
>>> do a conditional GET and possibly get a 304, rather than requiring
>>> mod_proxy to rerequest the entire entity again).
>>>
>>> When I originally started looking at the family of cache modules, I
>>> assumed that all of the decision-making logic would be in mod_cache,
>>> while the mod_xxx_cache providers would be "dumb" file-stores (at  
>>> least,
>>> as far as mod_cache is concerned).  Is this not the case?
>>
>> I'm working on this. You may want to check my proposal at
>> http://verdesmares.com/Apache/proposal.txt
>>
>>>
>>> If it is, would patches be acceptable if I have the time to try to
>>> rectify the situation (at least somewhat)?
>>
>> http://verdesmares.com/Apache/patches/022.patch
>>
>> I'm still working on it, things may change radically.
>>
>> -- 
>> Davi Arnaut
>>


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.
This looks familiar.  I seem to remembering seeing patches for this a
few months back.   Were they not committed to trunk?  If not, is there
any reason why not?  I'd hate to spend serious time making modifications
only to have to redo the work when this (pretty major) patchset gets
committed...

  Issac

Davi Arnaut wrote:
>
> On 13/09/2006, at 16:29, Issac Goldstand wrote:
>
>> Hi all,
>>   I've been hacking at mod_cache a bit, and was surprised to find that
>> part of the decision to serve previously cached content or not was being
>> made by the backend provider and not mod_cache; specifically, the
>> expiration date of the content seems to be checked by mod_disk_cache (as
>> part of open_entity), and if the provider check fails, mod_cache doesn't
>> even know about the entity (and therefore, in the case of a caching
>> proxy,  can't treat it as a possibly stale entity upon which it can just
>> do a conditional GET and possibly get a 304, rather than requiring
>> mod_proxy to rerequest the entire entity again).
>>
>> When I originally started looking at the family of cache modules, I
>> assumed that all of the decision-making logic would be in mod_cache,
>> while the mod_xxx_cache providers would be "dumb" file-stores (at least,
>> as far as mod_cache is concerned).  Is this not the case?
>
> I'm working on this. You may want to check my proposal at
> http://verdesmares.com/Apache/proposal.txt
>
>>
>> If it is, would patches be acceptable if I have the time to try to
>> rectify the situation (at least somewhat)?
>
> http://verdesmares.com/Apache/patches/022.patch
>
> I'm still working on it, things may change radically.
>
> -- 
> Davi Arnaut
>

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 14/09/2006, at 09:21, Davi Arnaut wrote:

>
> On 14/09/2006, at 09:06, Niklas Edmundsson wrote:
>
>> On Thu, 14 Sep 2006, Davi Arnaut wrote:
>>
>>>
>>> On 14/09/2006, at 04:24, Niklas Edmundsson wrote:
>>>
>>>> On Wed, 13 Sep 2006, Davi Arnaut wrote:
>>>>> I'm working on this. You may want to check my proposal at  
>>>>> http://verdesmares.com/Apache/proposal.txt
>>>> Will it be possible to do away with "one file for headers and  
>>>> one file for body" in mod_disk_cache with this scheme?
>>>
>>> http://verdesmares.com/Apache/patches/016.patch
>>
>> OK. You seem to dump the body right after the headers though, so  
>> you won't be able to do header rewrites.
>
> Could you kindly point me to the cache code that rewrites only the  
> headers ?

Oops, I spoke too much too soon :)

>
>>
>> Also, it's rather unneccessary to call the files ".cache" if there  
>> are only one type of files ;)
>
> That's convenience, there may be other type of files on the same  
> cache directory that are created by other tools.
>
> --
> Davi Arnaut
>


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Niklas Edmundsson wrote:

> If I remember correctly the code in 2.2.3 only does whole-file 
> revalidation,

No, it can have a stale handle that it "makes fresh" if it gets a 304.

-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Thu, 14 Sep 2006, Davi Arnaut wrote:

>>>>> I'm working on this. You may want to check my proposal at 
>>>>> http://verdesmares.com/Apache/proposal.txt
>>>> Will it be possible to do away with "one file for headers and one file 
>>>> for body" in mod_disk_cache with this scheme?
>>> 
>>> http://verdesmares.com/Apache/patches/016.patch
>> 
>> OK. You seem to dump the body right after the headers though, so you won't 
>> be able to do header rewrites.
>
> Could you kindly point me to the cache code that rewrites only the headers ?

If I remember correctly the code in 2.2.3 only does whole-file 
revalidation, the next logical step (that our patch does) is to make 
it understand that if the source file hasn't changed you don't have to 
copy the whole file since it's enough to just update the headers.

Our patch does this, because it's needed to get decent performance 
when juggling dvd images (yes, recaching a 4GB file is rather 
expensive).

There are a couple of trivial improvements like this that needs to be 
done in mod_disk_cache that depends on the underlying disk storage 
layer "done right". However, given the current state of mod_disk_cache 
almost everything is an improvement...

>> Also, it's rather unneccessary to call the files ".cache" if there are only 
>> one type of files ;)
>
> That's convenience, there may be other type of files on the same cache 
> directory that are created by other tools.

That seems silly to me, the cache directory structure should be 
strictly private to the cache.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  "You have learned much, young one." - Vader
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 14/09/2006, at 09:06, Niklas Edmundsson wrote:

> On Thu, 14 Sep 2006, Davi Arnaut wrote:
>
>>
>> On 14/09/2006, at 04:24, Niklas Edmundsson wrote:
>>
>>> On Wed, 13 Sep 2006, Davi Arnaut wrote:
>>>> I'm working on this. You may want to check my proposal at http:// 
>>>> verdesmares.com/Apache/proposal.txt
>>> Will it be possible to do away with "one file for headers and one  
>>> file for body" in mod_disk_cache with this scheme?
>>
>> http://verdesmares.com/Apache/patches/016.patch
>
> OK. You seem to dump the body right after the headers though, so  
> you won't be able to do header rewrites.

Could you kindly point me to the cache code that rewrites only the  
headers ?

>
> Also, it's rather unneccessary to call the files ".cache" if there  
> are only one type of files ;)

That's convenience, there may be other type of files on the same  
cache directory that are created by other tools.

--
Davi Arnaut


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Thu, 14 Sep 2006, Davi Arnaut wrote:

>
> On 14/09/2006, at 04:24, Niklas Edmundsson wrote:
>
>> On Wed, 13 Sep 2006, Davi Arnaut wrote:
>> 
>>> I'm working on this. You may want to check my proposal at 
>>> http://verdesmares.com/Apache/proposal.txt
>> 
>> Will it be possible to do away with "one file for headers and one file for 
>> body" in mod_disk_cache with this scheme?
>
> http://verdesmares.com/Apache/patches/016.patch

OK. You seem to dump the body right after the headers though, so you 
won't be able to do header rewrites.

Also, it's rather unneccessary to call the files ".cache" if there are 
only one type of files ;)

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  "You have learned much, young one." - Vader
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 14/09/2006, at 04:24, Niklas Edmundsson wrote:

> On Wed, 13 Sep 2006, Davi Arnaut wrote:
>
>> I'm working on this. You may want to check my proposal at http:// 
>> verdesmares.com/Apache/proposal.txt
>
> Will it be possible to do away with "one file for headers and one  
> file for body" in mod_disk_cache with this scheme?

http://verdesmares.com/Apache/patches/016.patch

>
> The thing is that I've been pounding seriously at mod_disk_cache to  
> make it able to sustain rather heavy load on not-so-heavy  
> equipment, and part of that effort was to wrap headers and body  
> into one file for mainly the following purposes:
>
> * Less files, less open():s (small gain)
> * Way much easier to purge old entries from the cache (huge gain).
>   Simply list all files in cache, sort by atime and remove the oldest.
>   The old way by using htcacheclean took ages and had less useful
>   removal criteria.
> * No synchronisation issues between the header file and body file,
>   unlink one and it's gone.
>
> That's only one of many changes made, but I found it to be crucial  
> to be able to have an architecture that's consistent without  
> relying on locks. This made it rather easy to implement stuff like  
> serving files that are currently being cached from cache, reusing  
> expired cached files if the originating file is found to be  
> unmodified, and so on.
>
> But the largest gain is still the cache cleaning process.
>
> The stuff is used in production and seems stable, however I haven't  
> had any response to the first (trivial) patch sent so I don't know  
> if there's any interest in this.
>
> /Nikke
> -- 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
> =-=-=-=-
>  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |      
> nikke@acc.umu.se
> ---------------------------------------------------------------------- 
> -----
>  Does the Little Mermaid wear an algebra?
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
> =-=-=-=


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
On Thu, September 21, 2006 11:05 am, Issac Goldstand wrote:

> Based on that, it seems to me that the sensible thing to do would be to
> update the header file to include trailers after the response is
> complete (and send them as-is as trailers to the initial client).  If
> we're already doing that, then it would probably also make sense to
> calculate the entity-length to update the headers afterwards.

This makes sense - once completely cached, all cached entities should have
a content length header added, even if its done after the entry is
finished being cached.

mod_proxy would need to cooperate by passing the trailers somehow up the
filter stack if a trailer is present, cache_save would then add the
trailer to the existing headers.

In theory, all mod_proxy needs to do is add the trailer to the headers
list when one is received. Once mod_cache has finished caching an entity,
mod_cache could then check and see if the length of the header list has
changed from when the request started - if so, it means a trailer was
present and the headers must be updated.

> However,
> I'm not even sure as to where such things should be implemented, in
> mod_cache, mod_proxy(_http), or http filters, or somewhere else entirely?

mod_proxy definitely needs to be involved by it not ignoring trailers.
mod_cache can however cache anything in the server (CGI, anything), so it
cannot be assumed that mod_proxy will always be involved when caching.

Regards,
Graham
--



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Issac Goldstand wrote:
> In any case, if we're proxying for an HTTP/1.0 client using HTTP/1.1
> (too tired to check if mod_proxy preserves HTTP version - but will try
> to check tomorrow if no one beats me to it), or even serving cached
> content to a 1.0 client originally received by a 1.1 request, we'd have
> to do all this even now, wouldn't we?
>   
As promised, I looked into this.  mod_proxy de-chunks the incoming 
response, mod_cache caches it un-chunked, without the content-length or 
transport-encoding headers, and the content-length output filter decides 
what to do with it.  Trailers were a bit tricky since I'm not quite sure 
that I rolled them properly, but they were stripped from the response 
when apache de-chunked the proxy requests.  However, they were also 
neither cached nor even forwarded to the client.

While looking into this, I also stumbled across a paper[1]  summarizing 
some key changes between versions 1.0 and 1.1 of the protocol, which 
pointed out some useful specific examples[2] about trailers.

Based on that, it seems to me that the sensible thing to do would be to 
update the header file to include trailers after the response is 
complete (and send them as-is as trailers to the initial client).  If 
we're already doing that, then it would probably also make sense to 
calculate the entity-length to update the headers afterwards.  However, 
I'm not even sure as to where such things should be implemented, in 
mod_cache, mod_proxy(_http), or http filters, or somewhere else entirely?

[1] http://www8.org/w8-papers/5c-protocols/key/key.html
[2] 
http://www8.org/w8-papers/5c-protocols/key/key.html#SECTION00062000000000000000

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Henrik Nordstrom <hn...@squid-cache.org>.
tor 2006-09-21 klockan 00:19 +0300 skrev Issac Goldstand:

> The only really relevant line I saw (in a quick 15 minute review) is RFC
> 2616-3.6 (regarding transfer-encodings):
>    "Transfer-coding values are used to indicate an encoding
>    transformation that has been, can be, or may need to be applied to an
>    entity-body in order to ensure "safe transport" through the network.
>    This differs from a content coding in that the transfer-coding is a
>    property of the message, not of the original entity."
> 
> Based on that, it seems to be ok.  However, we'd have to remove strong
> ETags as a side-effect if it was done (since strong ETags change when
> entity headers change).

Hmm... transfer-encoding is a function of the transport alone, not the
entity. Don't mix these up. The entity is unaltered by
transfer-encoding, it's only how it's transferred over the transport
(i.e. TCP) which is altered. This also means that transfer-encoding is
hop-by-hop. In applications layered along the intentions of the RFC then
a cache (any level, browser or proxy) would never see any transfer
encoding as this should have been decoded by the receiving protocol
handler, only identity encoding should be seen.

This is different from Content-Encoding which does alter the entity as
such. Modifications of the Content-Encoding must also account for ETag:s
as no two entity variants of the same URL may carry the same strong
ETag.

> And move trailers into headers (another reason
> to rewrite the headers file at the end).  And probably other things
> which I'm not think of...

Thats always ok. the division of main and trailer headers is also mainly
a transport thing. Only available with chunked encoding btw as it's the
only transfer mechanism which allows for a tralier. The specs allows you
to drop any trailer headers if hard to deal with or to merge them with
the main header if you can.

In direct chunked->chunked proxy transfer you should proxy the trailer
as well. In chunked->identiy transfer (i.e. HTTP/1.1 response ->
HTTP/1.0 client) the tralier is silently dropped as there is no means to
transfer the trailer in HTTP/1.0, and you can't rewind a TCP stream to
add data earlier...

Regards
Henrik

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

>    "This differs from a content coding in that the transfer-coding is a
>    property of the message, not of the original entity."
> 
> Based on that, it seems to be ok.  However, we'd have to remove strong
> ETags as a side-effect if it was done (since strong ETags change when
> entity headers change).  

I'm too tired - I'm not even reading my own points.  Based on the above
"property of the message, not property of the entity", this may not be
true.  I'm going to go to bed now, before I make a bigger fool of myself
than I've already managed to this evening.

  Issac

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Ruediger Pluem wrote:
> 
> On 09/20/2006 09:59 PM, Issac Goldstand wrote:
>> Ruediger Pluem wrote:
>>> First of all I guess you mean: BEFORE the CACHE_SAVE filter :-).
>>> Yes, there is a reason why we cannot do this: This would create a possible DoS, because we have to
>>> suck in the whole response first before actually forwarding it. Also this would not work with flush
>>> buckets.
>>
>> Well, yes.  I stuck de-chunk in there as an afterthought (the original
>> check just being a sanity check on the reported entity size to take care
>> of that 0 length case).
>>
>> Why the DoS, though?  No reason to suck everything in first - my thought
> 
> I thought you wanted to use this as a prevention for the possible DoS that is prevented by
> CacheMaxFileSize.
> 
>> was to update the headers a second time after the body was written.
>> Only thing we need to hang on to is byte count and status (eg, headers
> 
> I am not sure if its allowed for the cache to change the transport encoding. If yes
> I guess this makes sense.
> 

I don't think it says one way or another straight-out.

The only really relevant line I saw (in a quick 15 minute review) is RFC
2616-3.6 (regarding transfer-encodings):
   "Transfer-coding values are used to indicate an encoding
   transformation that has been, can be, or may need to be applied to an
   entity-body in order to ensure "safe transport" through the network.
   This differs from a content coding in that the transfer-coding is a
   property of the message, not of the original entity."

Based on that, it seems to be ok.  However, we'd have to remove strong
ETags as a side-effect if it was done (since strong ETags change when
entity headers change).  And move trailers into headers (another reason
to rewrite the headers file at the end).  And probably other things
which I'm not think of...

In any case, if we're proxying for an HTTP/1.0 client using HTTP/1.1
(too tired to check if mod_proxy preserves HTTP version - but will try
to check tomorrow if no one beats me to it), or even serving cached
content to a 1.0 client originally received by a 1.1 request, we'd have
to do all this even now, wouldn't we?

  Issac

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Ruediger Pluem <rp...@apache.org>.

On 09/20/2006 09:59 PM, Issac Goldstand wrote:
> 
> Ruediger Pluem wrote:
>>
>>First of all I guess you mean: BEFORE the CACHE_SAVE filter :-).
>>Yes, there is a reason why we cannot do this: This would create a possible DoS, because we have to
>>suck in the whole response first before actually forwarding it. Also this would not work with flush
>>buckets.
> 
> 
> Well, yes.  I stuck de-chunk in there as an afterthought (the original
> check just being a sanity check on the reported entity size to take care
> of that 0 length case).
> 
> Why the DoS, though?  No reason to suck everything in first - my thought

I thought you wanted to use this as a prevention for the possible DoS that is prevented by
CacheMaxFileSize.

> was to update the headers a second time after the body was written.
> Only thing we need to hang on to is byte count and status (eg, headers

I am not sure if its allowed for the cache to change the transport encoding. If yes
I guess this makes sense.

Regards

RĂ¼diger

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Ruediger Pluem wrote:
> 
> On 09/20/2006 08:27 PM, Issac Goldstand wrote:
>> Graham Leggett wrote:
>>
>>> On Wed, September 20, 2006 5:27 pm, Brian Akins wrote:
>>>
>>>
>>>> unless 0 is a valid content-length, which it can be.  Also, what about
>>>> when we are reading something in without a know C-L, for example from an
>>>> origin doing chunks?
>>> I am not sure what the current cache code does to handle chunked entities
>>> without a content length - in theory allowing the code to cache bodies of
>>> no predetermined size leaves the cache open to a potential DoS.
>>>
> 
> You can set a max cache file size (CacheMaxFileSize) which prevents caching files that are larger then
> a specfic size. This is checked after each bucket is written to the disk. If the
> stream is larger then the max file size the file gets deleted and caching of this request
> is stopped. So this also works with chunked responses.

Yes - I didn't even consider this issue.

> 
>>
>> Nothing, IIRC.  Any reason we can't add a C-L filter immediately after
>> CACHE_SAVE to de-chunk and C-L it as needed?
> 
> First of all I guess you mean: BEFORE the CACHE_SAVE filter :-).
> Yes, there is a reason why we cannot do this: This would create a possible DoS, because we have to
> suck in the whole response first before actually forwarding it. Also this would not work with flush
> buckets.

Well, yes.  I stuck de-chunk in there as an afterthought (the original
check just being a sanity check on the reported entity size to take care
of that 0 length case).

Why the DoS, though?  No reason to suck everything in first - my thought
was to update the headers a second time after the body was written.
Only thing we need to hang on to is byte count and status (eg, headers
ended, de-chunking state, etc). Ditto for flush buckets.  What am I not
considering?

  Issac

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Thu, 21 Sep 2006, Graham Leggett wrote:

> Hmmm - this affects the case where another process/thread is delivering
> from a still-being-cached entity.
>
> If the lead thread decides to stop, and other threads are following, the
> other following threads will deliver CacheMaxFileSize data, and cut the
> request short.
>
> One workaround for this problem is to have following threads ignore the
> cached entity if the entity does not have a content length - something the
> entity will have when caching is complete.
>
> This means the backend server will still see a spike of traffic while the
> object is being cached, but the cache will no try and cache multiple
> entities until the first one wins, which happens now.

Our patch solves this by pausing read-threads while the object is 
being cached until there is a known length of the body, with a timeout 
to detect if the caching thread has died. Drawback is that you have to 
write the header twice, but that's cheap compared to caching an object 
N times. Seems to do the trick, but it haven't had nearly as much 
pounding as the cache-from-file case (the main use is on a ftp server, 
remember).

/Nikke - currently working on getting a machine up for testing
          the smaller patches before submitting them.
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke@acc.umu.se 
---------------------------------------------------------------------------
  The dog ate my .REP packet.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
On Wed, September 20, 2006 9:50 pm, Ruediger Pluem wrote:

> You can set a max cache file size (CacheMaxFileSize) which prevents
> caching files that are larger then
> a specfic size. This is checked after each bucket is written to the disk.
> If the
> stream is larger then the max file size the file gets deleted and caching
> of this request
> is stopped. So this also works with chunked responses.

Hmmm - this affects the case where another process/thread is delivering
from a still-being-cached entity.

If the lead thread decides to stop, and other threads are following, the
other following threads will deliver CacheMaxFileSize data, and cut the
request short.

One workaround for this problem is to have following threads ignore the
cached entity if the entity does not have a content length - something the
entity will have when caching is complete.

This means the backend server will still see a spike of traffic while the
object is being cached, but the cache will no try and cache multiple
entities until the first one wins, which happens now.

Regards,
Graham
--



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Ruediger Pluem <rp...@apache.org>.

On 09/20/2006 08:27 PM, Issac Goldstand wrote:
> 
> Graham Leggett wrote:
> 
>>On Wed, September 20, 2006 5:27 pm, Brian Akins wrote:
>>
>>
>>>unless 0 is a valid content-length, which it can be.  Also, what about
>>>when we are reading something in without a know C-L, for example from an
>>>origin doing chunks?
>>
>>I am not sure what the current cache code does to handle chunked entities
>>without a content length - in theory allowing the code to cache bodies of
>>no predetermined size leaves the cache open to a potential DoS.
>>

You can set a max cache file size (CacheMaxFileSize) which prevents caching files that are larger then
a specfic size. This is checked after each bucket is written to the disk. If the
stream is larger then the max file size the file gets deleted and caching of this request
is stopped. So this also works with chunked responses.

> 
> 
> Nothing, IIRC.  Any reason we can't add a C-L filter immediately after
> CACHE_SAVE to de-chunk and C-L it as needed?

First of all I guess you mean: BEFORE the CACHE_SAVE filter :-).
Yes, there is a reason why we cannot do this: This would create a possible DoS, because we have to
suck in the whole response first before actually forwarding it. Also this would not work with flush
buckets.

Regards

RĂ¼diger


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Graham Leggett wrote:
> On Wed, September 20, 2006 5:27 pm, Brian Akins wrote:
> 
>> unless 0 is a valid content-length, which it can be.  Also, what about
>> when we are reading something in without a know C-L, for example from an
>> origin doing chunks?
> 
> I am not sure what the current cache code does to handle chunked entities
> without a content length - in theory allowing the code to cache bodies of
> no predetermined size leaves the cache open to a potential DoS.
> 

Nothing, IIRC.  Any reason we can't add a C-L filter immediately after
CACHE_SAVE to de-chunk and C-L it as needed?

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
On Wed, September 20, 2006 5:27 pm, Brian Akins wrote:

> unless 0 is a valid content-length, which it can be.  Also, what about
> when we are reading something in without a know C-L, for example from an
> origin doing chunks?

I am not sure what the current cache code does to handle chunked entities
without a content length - in theory allowing the code to cache bodies of
no predetermined size leaves the cache open to a potential DoS.

Regards,
Graham
--



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Brian Akins wrote:
> Issac Goldstand wrote:
>> I don't understand why bother getting so complex.  Touch/truncate the
>> body file when storing the header, and then a missing body means things
>> have gone amok - retry the request.  Conversely, a zero-length, or < C-L
>> body length means another thread is working on the body.
>>
> 
> unless 0 is a valid content-length, which it can be. 

In that case, we'll know that from having read the header file.
Frankly, in that case we don't even need to look for a data file, as an
optimization, and can even probably safely delete the empty file.

> Also, what about
> when we are reading something in without a know C-L, for example from an
> origin doing chunks?

We'll know it's chunked, and the possibility of getting a chunked body
of 0 length and not having the initial 0 chunk length immediately
following the headers from the response is pretty slim, IMHO.

For any other length we don't introduce any problem we don't have now.

>>  > You're right, this is a tricky one, but there is a solution out there.
>> Maybe we're attacking the problem from the wrong angle.  Rather than
>> modifying mod_cache, modify the garbage-collector (e.g.,
>> htcacheclean). Do a two pass cleanup. 
> 
> I think it's insane that it has to traverse the directory structure to
> do find the objects.  There should be an index of objects.  Traversing
> the tree can be a huge hit on large, busy structures.
> 

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Issac Goldstand wrote:
> I don't understand why bother getting so complex.  Touch/truncate the
> body file when storing the header, and then a missing body means things
> have gone amok - retry the request.  Conversely, a zero-length, or < C-L
> body length means another thread is working on the body.
> 

unless 0 is a valid content-length, which it can be.  Also, what about 
when we are reading something in without a know C-L, for example from an 
origin doing chunks?


>  > You're right, this is a tricky one, but there is a solution out there.
> Maybe we're attacking the problem from the wrong angle.  Rather than
> modifying mod_cache, modify the garbage-collector (e.g., htcacheclean). 
> Do a two pass cleanup. 

I think it's insane that it has to traverse the directory structure to 
do find the objects.  There should be an index of objects.  Traversing 
the tree can be a huge hit on large, busy structures.


-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Graham Leggett wrote:
> Niklas Edmundsson wrote:
>
>> However, I don't see how you can do a lockless design with multiple 
>> files and an index that can do:
>>
>> * Clients read from the cache as files are being cached.
>> * Only one session caches the same file.
>> * Header/Body updates.
>> * No index/files out-of-sync issues. Ever.
>
> Thinking about this some more I do see a race during purging - a cache 
> thread could read the header, the purge deletes header and body, and 
> then the cache thread reads the body, and interprets the missing body 
> as "the body is still coming".
>
> One possible (and reasonably simple) solution would be to cache the 
> header and body in a unique directory - the directory name becomes the 
> key, and the entry is either cached completely / still being cached if 
> the directory exists. This assumes it's possible to atomically delete 
> directories.
>
I don't understand why bother getting so complex.  Touch/truncate the 
body file when storing the header, and then a missing body means things 
have gone amok - retry the request.  Conversely, a zero-length, or < C-L 
body length means another thread is working on the body.

> Another option is to version the filename of the body based on a key 
> in the header. In other words, in the header, called <key>.header, is 
> a version number <timestamp>, meaning there should be a body called 
> <key>.<timestamp>.body. A replacement cached entry therefore cannot 
> stomp on what pre existing threads are doing. If the body file is 
> created first, before the header file, then a non existent body file 
> means "this entry has been invalidated, try the request again".
>
> There is an assumption that <timestamp> is fine grained enough to be 
> unique.
>
> You're right, this is a tricky one, but there is a solution out there.
Maybe we're attacking the problem from the wrong angle.  Rather than 
modifying mod_cache, modify the garbage-collector (e.g., htcacheclean).  
Do a two pass cleanup.  The first pass is a data-store transversal pass 
which decides what to remove.  It immediately purges the header file, 
and stores the entity key (or filename, or whatever it needs to 
re-access the entity) in a list.  Once the first pass finishes, a second 
pass is made leisurely cleaning up all of the entities that are still 
missing their header files (that way, if a mod_cache thread re-caches 
the entity, we won't purge it).

That should be a safe solution, provided that the time taken to perform 
the first pass is shorter than the time between opening the header and 
body files.  That should normally be the case, unless someone can come 
up with a reasonable case where it wouldn't be so?

  Issac

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Wed, 20 Sep 2006, Brian Akins wrote:

> Niklas Edmundsson wrote:
> don't care about performance...
>> 
>> Actually, cache on xfs mounted with atime doesn't seem to be a performance 
>> killer oddly enough... Our frontends had no problems surviving 1k 
>> requests/s during the latest mozilla-update-barrage.
>
> 1k requests/second is not really that much...  10k requests/second is more 
> what I'm used to.  XFS sucks for us as a cache storage.  It tends to crock 
> under some traffic patterns (reads vs writes).  ext3 is actually more 
> reliable for us.  Reiserfs is interesting, but tends to go haywire from time 
> to time.

I think the key difference here is our average file size... We don't 
need that many requests/s to bottom out gige normally.

> We clean our cache often because we have a really quick way to find the size 
> and remove the oldest expired objects first.  Every cache store gets recorded 
> in SQLite with info about the object (size, mtime, expire time, url, key, 
> etc.).  Makes it trivial tow write cron jobs to do cache management.

Yup.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  Don't force it, use a bigger hammer
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Niklas Edmundsson wrote:
don't care about performance...
> 
> Actually, cache on xfs mounted with atime doesn't seem to be a 
> performance killer oddly enough... Our frontends had no problems 
> surviving 1k requests/s during the latest mozilla-update-barrage.

1k requests/second is not really that much...  10k requests/second is 
more what I'm used to.  XFS sucks for us as a cache storage.  It tends 
to crock under some traffic patterns (reads vs writes).  ext3 is 
actually more reliable for us.  Reiserfs is interesting, but tends to go 
haywire from time to time.

We clean our cache often because we have a really quick way to find the 
size and remove the oldest expired objects first.  Every cache store 
gets recorded in SQLite with info about the object (size, mtime, expire 
time, url, key, etc.).  Makes it trivial tow write cron jobs to do cache 
management.

-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Mon, 18 Sep 2006, Brian Akins wrote:

> Graham Leggett wrote:
>
>> I have not seen inside the htcacheclean code, why is the code reading the
>> headers? In theory the cache should be purged based on last access time,
>> deleted as space is needed.
>
> Everyone should be mounting cache directories noatime, unless they don't care 
> about performance...

Actually, cache on xfs mounted with atime doesn't seem to be a 
performance killer oddly enough... Our frontends had no problems 
surviving 1k requests/s during the latest mozilla-update-barrage. 
Other mirrors had problems, so it seems we ended up with taking the 
majority of the load...

That said: yes, noatime is quicker but if you want to be able to clean 
your cache often (think new linux distro release which quickly fills 
up the cache with new contents) atime+fs traversal is a better 
combined solution than having to open/read every header.


/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  That's not a bug. It's supposed to do that.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Graham Leggett wrote:

> I have not seen inside the htcacheclean code, why is the code reading the
> headers? In theory the cache should be purged based on last access time,
> deleted as space is needed.

Everyone should be mounting cache directories noatime, unless they don't 
care about performance...

> Your patch is battle tested, and fixes some specific problems, the only
> issue that I think needs to be resolved is the question of whether single
> file or multiple files are preferable, taking into account performance on
> platforms other that Linux as well.

I'm very interested in this as well.  Very good ideas that just need a 
little refinement.


-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
On Mon, September 18, 2006 9:35 am, Niklas Edmundsson wrote:

> The easiest way to deal with this might be to have a timeout, if the
> body hasn't shown up in $timeout time then something went bad,
> DECLINE, meaning that the cache layer thinks it should cache the file
> and acts accordingly. You actually want this fallback anyway, and it's
> probably enough to deal with the purge-problem. The purge should
> delete the oldest unused entries anyway, so the chance of hitting that
> case shouldn't be too common.

A timeout like this isn't different from a backend that doesn't respond
within a reasonable amount of time - this could be the same timeout.

> And yes, since this scheme only might cause on-disk stray files that
> can be cleaned up by purging I can agree that it'll work. However, I
> strongly believe that the purging should not have to read each header
> file the way that htcacheclean currently does it since it poses such a
> strain on the cache filesystem. A file system traversal should be
> enough.

I have not seen inside the htcacheclean code, why is the code reading the
headers? In theory the cache should be purged based on last access time,
deleted as space is needed.

> Anyhow, I can probably rather easily adapt our patches to do it this
> way if that's what people want. I'm not entirely sure what the gain
> would be though, since it's a tad more housekeeping work and double
> the number of inodes to traverse during a purge...

I would like to investigate Brian's comments that having the body in a
single file is a performance win before a method is chosen.

> But, that is future work. I haven't had any comment of the current
> patch of mine yet (lfs-config) so I'm not entirely sure of whether it
> seems OK and I should proceed with the next patch or what.

Your patch is battle tested, and fixes some specific problems, the only
issue that I think needs to be resolved is the question of whether single
file or multiple files are preferable, taking into account performance on
platforms other that Linux as well.

What would help a lot is to break the patch down into bits that fix
specific issues. The "load the entire file into RAM before delivery" issue
is a definite candicate for solving.

Regards,
Graham
--



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Sun, 17 Sep 2006, Graham Leggett wrote:

> Niklas Edmundsson wrote:
>
>> However, I don't see how you can do a lockless design with multiple files 
>> and an index that can do:
>> 
>> * Clients read from the cache as files are being cached.
>> * Only one session caches the same file.
>> * Header/Body updates.
>> * No index/files out-of-sync issues. Ever.
>
> Thinking about this some more I do see a race during purging - a cache thread 
> could read the header, the purge deletes header and body, and then the cache 
> thread reads the body, and interprets the missing body as "the body is still 
> coming".

The easiest way to deal with this might be to have a timeout, if the 
body hasn't shown up in $timeout time then something went bad, 
DECLINE, meaning that the cache layer thinks it should cache the file 
and acts accordingly. You actually want this fallback anyway, and it's 
probably enough to deal with the purge-problem. The purge should 
delete the oldest unused entries anyway, so the chance of hitting that 
case shouldn't be too common.

And yes, since this scheme only might cause on-disk stray files that 
can be cleaned up by purging I can agree that it'll work. However, I 
strongly believe that the purging should not have to read each header 
file the way that htcacheclean currently does it since it poses such a 
strain on the cache filesystem. A file system traversal should be 
enough.

Anyhow, I can probably rather easily adapt our patches to do it this 
way if that's what people want. I'm not entirely sure what the gain 
would be though, since it's a tad more housekeeping work and double 
the number of inodes to traverse during a purge...

But, that is future work. I haven't had any comment of the current 
patch of mine yet (lfs-config) so I'm not entirely sure of whether it 
seems OK and I should proceed with the next patch or what. I'm not 
that well endowed in all API:s involved, and stuff that looks right to 
me might have a much better Apachier solution so I don't want to get 
carried away creating huge patchsets to having the first one rejected 
because my coding style sucks... However, I can understand if you want 
a complete patch that solves the lfs issues, but then you'll have to 
tell me since I'm not a mind reader ;)

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  ************ <--- tribbles playing follow-the-leader
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
Niklas Edmundsson wrote:

> However, I don't see how you can do a lockless design with multiple 
> files and an index that can do:
> 
> * Clients read from the cache as files are being cached.
> * Only one session caches the same file.
> * Header/Body updates.
> * No index/files out-of-sync issues. Ever.

Thinking about this some more I do see a race during purging - a cache 
thread could read the header, the purge deletes header and body, and 
then the cache thread reads the body, and interprets the missing body as 
"the body is still coming".

One possible (and reasonably simple) solution would be to cache the 
header and body in a unique directory - the directory name becomes the 
key, and the entry is either cached completely / still being cached if 
the directory exists. This assumes it's possible to atomically delete 
directories.

Another option is to version the filename of the body based on a key in 
the header. In other words, in the header, called <key>.header, is a 
version number <timestamp>, meaning there should be a body called 
<key>.<timestamp>.body. A replacement cached entry therefore cannot 
stomp on what pre existing threads are doing. If the body file is 
created first, before the header file, then a non existent body file 
means "this entry has been invalidated, try the request again".

There is an assumption that <timestamp> is fine grained enough to be unique.

You're right, this is a tricky one, but there is a solution out there.

Regards,
Graham
--


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
Niklas Edmundsson wrote:

> However, I don't see how you can do a lockless design with multiple 
> files and an index that can do:
> 
> * Clients read from the cache as files are being cached.
> * Only one session caches the same file.
> * Header/Body updates.
> * No index/files out-of-sync issues. Ever.

It's easy - simply treat the header file as a "single file". If the 
header file exists, the entry is either cached or being cached. The 
existence or non existence of a body file is meaningless in the context 
of the cache without a header file.

If the body file doesn't exist, it's 99.999% of the time going to be 
because the lead thread that is caching the file hasn't got round to 
creating it yet - simply wait around for the file to exist and continue 
as normal. In other words a body file that does not exist is treated as 
a body file of zero length. This should happen rarely enough in practice 
that the code path to detect and deal with the file-not-found error 
should not cause a performance penalty.

The header file is always deleted first in a purge, the body file is 
deleted afterwards at leisure. When a header file is created, the body 
file is created afterwards ensuring any previous file is reset to length 
zero.

Cache purges would delete the header and then the body. And body's 
floating around without headers are orphaned files and can be deleted 
during a purge anyway.

> The current mod_disk_cache seems to be designed for small files and 
> enough memory to hide the problems by the design.

These don't look like design problems, but rather just run of the mill 
bugs. The cache for example should definitely not try and load the 
cached file into RAM first, this has a simple fix.

Regards,
Graham
--

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Niklas Edmundsson wrote:
> On Mon, 18 Sep 2006, Brian Akins wrote:
>
>> Niklas Edmundsson wrote:
>>
>
>>> * Only one session caches the same file.
>>
>> Easy to do if we use deterministic tmp files and not the way we 
>> currently do it.  Then all you have to do is when creating temp files 
>> use O_EXCL.
>
> Or, if we skip the tmp files altogether.
Which would happen to be great on win32 systems, for example, where 
renaming the temp data file fails immediately if another thread happens 
to be serving content from the old one, thus leaving you with a new 
header file with an old data file, which is a useless mess.


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Mon, 18 Sep 2006, Brian Akins wrote:

> Niklas Edmundsson wrote:
>
>> 
>> Extra tracking sounds unnecessary if you can do it in a way that
>> doesn't need it.
>
> It's not "extra" it just adding some tracking.  When an objects gets cached 
> log (sql, db, whatever) that /blah/foo/bar.html is cached as 
> /cache/x/y/something.meta.  Then it's very easy to ask the "store" what is 
> /blah/foo/bar.html cached as?  There may be multiples because of vary.

"Extra" because you already have the needed info to puzzle the things 
together...

>> * Clients read from the cache as files are being cached.
>
> That's the hard one, IMO

But the implementation was rather easy once the "cache to separate 
file and mv to correct location"-stuff was ripped out. Or, as easy as 
building your own bucket-type is.

>> * Only one session caches the same file.
>
> Easy to do if we use deterministic tmp files and not the way we currently do 
> it.  Then all you have to do is when creating temp files use O_EXCL.

Or, if we skip the tmp files altogether.

>> * Header/Body updates.
>
> Eaiser with seperate files like mod_disk_cache does now.

True.

>> * No index/files out-of-sync issues. Ever.
>
> Hard to guarantee, but not impossible.  Always to index when storing file and 
> remove when deleting.  This should use something like providers so it's not 
> in core cache code and can be easily modified.
>
>> With locks, yes it's possible but also a hassle to get right with
>> performance intact.
>
> Not really that hard.  Trust me it has been done...

I'll take your word for that.

>> We, as a ftp mirror operated by a non-profit computer club, have a
>> slightly different usecase with single files larger than machine RAM
>> and a working set of approx 40 times larger than RAM. Some bad design
>> decisions in mod_disk_cache becomes really visible in this
>> environment.
>
> Seems to me you should approach problem differently, like rsyncing the 
> mirrored content.  I don't know your environment, but was just what I cam up 
> with off the top of my head.

Try rsyncing a few TB of content onto a few hundred GB of cache disk 
and see how that works out for you :)

Our setup is briefly described here by the way:
http://ftp.acc.umu.se/mirror/ftp-about.html

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  A closed mouth gathers no feet.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.
Brian Akins wrote:
> Issac Goldstand wrote:
> 
>>  I can see how other tracking information (like how often the
>> cached entity is accessed, last access time, etc) would be useful,
>>
> 
> Also, those statistics could be updated asynchronously by using a queue
> so that statistics doesn't slow down a busy web server.
> 

Not sure that it'd help.  With multiple processes/threads, it'd still
cause IO and physical disk head movements, which is what we really care
about, I think.

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Issac Goldstand wrote:

>  I can see how other tracking information (like how often the
> cached entity is accessed, last access time, etc) would be useful,
>

Also, those statistics could be updated asynchronously by using a queue 
so that statistics doesn't slow down a busy web server.


-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Brian Akins wrote:
> Issac Goldstand wrote:
>   I can see how other tracking information (like how often the
>> cached entity is accessed, last access time, etc) would be useful,
>> albeit expensive to keep track of, but I don't understand this specific
>> example.
> 
> It's not expensive, as these methods are only called when an object to
> added or deleted, which is relatively few if you get good cache hit ratio.

Not necessarily.  The point would be to track "hot" objects so the
garbage collector prioritizes them lower than a fresher entry which
doesn't get accessed much.  Tracking information like that would need to
be done on every cache hit.  stating the file is enough to get the
initial creation time + last access time (which is why it makes sense,
to  me at least, to prefer the filesystem's atime overhead to trying to
out-do that code on our own)

> There are instances when you need to "purge" url's selectively.   Think
> a publishing system that automatically purges cache for updated pages,
> for example.
> 

Right.  But you can still do it with an open_entity() (which takes the
cache key and sets the filehandles and everything else in the
cache_handle) and then remove_entity.  My gut instinct tells me that
rewriting remove_entity to take the key instead of the cache_handle will
have side-effects of tracking whether the filehandles in the
cache_object are open or closed...

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Issac Goldstand wrote:
   I can see how other tracking information (like how often the
> cached entity is accessed, last access time, etc) would be useful,
> albeit expensive to keep track of, but I don't understand this specific
> example.

It's not expensive, as these methods are only called when an object to 
added or deleted, which is relatively few if you get good cache hit ratio.

There are instances when you need to "purge" url's selectively.   Think 
a publishing system that automatically purges cache for updated pages, 
for example.


-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Issac Goldstand <ma...@beamartyr.net>.

Brian Akins wrote:
> Niklas Edmundsson wrote:
> 
>>
>> Extra tracking sounds unnecessary if you can do it in a way that
>> doesn't need it.
> 
> It's not "extra" it just adding some tracking.  When an objects gets
> cached log (sql, db, whatever) that /blah/foo/bar.html is cached as
> /cache/x/y/something.meta.  Then it's very easy to ask the "store" what
> is /blah/foo/bar.html cached as?  There may be multiples because of vary.
> 

You can do that now, though it may not be a "public" method.  Most
methods (create_entity, open_entity, etc) currently take the cache key
(generally the URL, but it any cae, it's what we'd be querying the store
for).  I can see how other tracking information (like how often the
cached entity is accessed, last access time, etc) would be useful,
albeit expensive to keep track of, but I don't understand this specific
example.


 Issac

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Niklas Edmundsson wrote:

> 
> Extra tracking sounds unnecessary if you can do it in a way that
> doesn't need it.

It's not "extra" it just adding some tracking.  When an objects gets 
cached log (sql, db, whatever) that /blah/foo/bar.html is cached as 
/cache/x/y/something.meta.  Then it's very easy to ask the "store" what 
is /blah/foo/bar.html cached as?  There may be multiples because of vary.

> * Clients read from the cache as files are being cached.

That's the hard one, IMO

> * Only one session caches the same file.

Easy to do if we use deterministic tmp files and not the way we 
currently do it.  Then all you have to do is when creating temp files 
use O_EXCL.

> * Header/Body updates.

Eaiser with seperate files like mod_disk_cache does now.

> * No index/files out-of-sync issues. Ever.

Hard to guarantee, but not impossible.  Always to index when storing 
file and remove when deleting.  This should use something like providers 
so it's not in core cache code and can be easily modified.

> With locks, yes it's possible but also a hassle to get right with
> performance intact.

Not really that hard.  Trust me it has been done...


> We, as a ftp mirror operated by a non-profit computer club, have a
> slightly different usecase with single files larger than machine RAM
> and a working set of approx 40 times larger than RAM. Some bad design
> decisions in mod_disk_cache becomes really visible in this
> environment.

Seems to me you should approach problem differently, like rsyncing the 
mirrored content.  I don't know your environment, but was just what I 
cam up with off the top of my head.


-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Fri, 15 Sep 2006, Brian Akins wrote:

> The separate header and body files work wonderfully for performance (filling 
> multiple gig interfaces and/or 30k requests/sec. or rather modest hardware). 
> If you have them all in one, it can make the sendfile for the body 
> cumbersome.

If you write to the file using mmap on linux, then sendfile() breaks 
yes. mmap didn't give any major performance benefit for the body copy 
though, so it doesn't matter and we don't use it. This is really a 
Linux bug, since non-overlapping write/sendfile should be OK.

> If you somehow track what "entries" or in the cache, it is very easy to purge 
> entries.

Extra tracking sounds unnecessary if you can do it in a way that 
doesn't need it.

> At Apachecon, I'll talk some about our version of mod_cache. 
> Unfortunately, I can't share code :( But I can tell you the separate 
> files way is not a performance or housekeeping issue.

If you have the index i can agree.

However, I don't see how you can do a lockless design with multiple 
files and an index that can do:

* Clients read from the cache as files are being cached.
* Only one session caches the same file.
* Header/Body updates.
* No index/files out-of-sync issues. Ever.

With locks, yes it's possible but also a hassle to get right with 
performance intact.

The current mod_disk_cache seems to be designed for small files and 
enough memory to hide the problems by the design. If you have files 
that fit into the OS cache then it doesn't matter if hundreds of 
sessions are caching the same file, it'll work out eventually without 
reduced performance. This isn't the case when each file (DVD image) is 
bigger than your memory and doesn't fit in the OS file cache. In fact 
you can tell that the author never even consider this due to the way 
the body is copied (on 32bit you loose).

We, as a ftp mirror operated by a non-profit computer club, have a 
slightly different usecase with single files larger than machine RAM 
and a working set of approx 40 times larger than RAM. Some bad design 
decisions in mod_disk_cache becomes really visible in this 
environment.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  I wish I had a snappy Trek Message to put here...
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Brian Akins <br...@turner.com>.
Niklas Edmundsson wrote:
> Will it be possible to do away with "one file for headers and one file
> for body" in mod_disk_cache with this scheme?
> 
> The thing is that I've been pounding seriously at mod_disk_cache to
> make it able to sustain rather heavy load on not-so-heavy equipment,
> and part of that effort was to wrap headers and body into one file for
> mainly the following purposes:

The separate header and body files work wonderfully for performance 
(filling multiple gig interfaces and/or 30k requests/sec. or rather 
modest hardware).  If you have them all in one, it can make the sendfile 
for the body cumbersome.

If you somehow track what "entries" or in the cache, it is very easy to 
purge entries.

At Apachecon, I'll talk some about our version of mod_cache. 
Unfortunately, I can't share code :( But I can tell you the separate 
files way is not a performance or housekeeping issue.



-- 
Brian Akins
Chief Operations Engineer
Turner Digital Media Technologies

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
On Thu, September 14, 2006 2:07 pm, Davi Arnaut wrote:

> The cache is required to send to the client the most up-to-date
> response, it doesn't mean it must cache it.

As I recall once cached, if an entry is stale and is revalidated, the
headers coming back with the 304 Not Modified must replace the headers in
the cache.

> What I meant is _if_ it causes significant slowdowns for a common
> cache hit path _probably_ it is better to just revalidate the hole
> entity.

The point behind the cache is that the cache is cheap, while the backend
is expensive. A cache slowdown usually isn't critical, as the cache is
usually significantly faster than the backend. Trying to save a few cycles
in the cache by hitting the backend unnecessarily gives you little
performance gain.

Regards,
Graham
--



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 14/09/2006, at 08:48, Graham Leggett wrote:

> On Thu, September 14, 2006 1:42 pm, Davi Arnaut wrote:
>
>> This is not a top priority since actually there is no complete
>> support for it in mod_cache (partial responses and such), but it
>> would be nice to have it.
>
> HTTP/1.1 compliance is mandatory for the cache. If it doesn't work  
> now, it
> needs to be fixed.

The cache is required to send to the client the most up-to-date  
response, it doesn't mean it must cache it.

What I meant is _if_ it causes significant slowdowns for a common  
cache hit path _probably_ it is better to just revalidate the hole  
entity.

--
Davi Arnaut


Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
On Thu, September 14, 2006 1:42 pm, Davi Arnaut wrote:

> This is not a top priority since actually there is no complete
> support for it in mod_cache (partial responses and such), but it
> would be nice to have it.

HTTP/1.1 compliance is mandatory for the cache. If it doesn't work now, it
needs to be fixed.

Regards,
Graham
--



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 14/09/2006, at 04:39, Graham Leggett wrote:

> Niklas Edmundsson wrote:
>
>> Will it be possible to do away with "one file for headers and one  
>> file for body" in mod_disk_cache with this scheme?
>
> This definitely has lots of advantages - however HTTP/1.1 requires  
> that it be possible to modify the headers on a cached entry  
> independently of the cached body. As long as this is catered for,  
> it should be fine.

This is not a top priority since actually there is no complete  
support for it in mod_cache (partial responses and such), but it  
would be nice to have it.

  We could later easily extend the format to support it.

--
Davi Arnaut



Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Thu, 14 Sep 2006, Graham Leggett wrote:

> Niklas Edmundsson wrote:
>
>> Will it be possible to do away with "one file for headers and one file for 
>> body" in mod_disk_cache with this scheme?
>
> This definitely has lots of advantages - however HTTP/1.1 requires that it be 
> possible to modify the headers on a cached entry independently of the cached 
> body. As long as this is catered for, it should be fine.

Our patch allows for this, the body is simply stored at an offset with 
some logic to detect headers larger than the offset and cope with that 
too (albeit this introduces a risk for bad data being sent to the 
client due to the lockless design, so you really want to avoid this by 
having the offset large enough).

Since seek():ing and writing to an offset doesn't occupy disk space in 
normal unix filesystems there isn't a problem in having the data at a 
rather large offset, but I don't know how non-unix behaves in this 
regard.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  To refuse praise is to seek praise twice.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
Niklas Edmundsson wrote:

> Will it be possible to do away with "one file for headers and one file 
> for body" in mod_disk_cache with this scheme?

This definitely has lots of advantages - however HTTP/1.1 requires that 
it be possible to modify the headers on a cached entry independently of 
the cached body. As long as this is catered for, it should be fine.

Regards,
Graham
--

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Graham Leggett <mi...@sharp.fm>.
Niklas Edmundsson wrote:

> The stuff is used in production and seems stable, however I haven't had 
> any response to the first (trivial) patch sent so I don't know if 
> there's any interest in this.

Can you post the patch again? Also, if you attach it to a bugzilla 
entry, it's less likely to get lost.

Regards,
Graham
--

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Niklas Edmundsson <ni...@acc.umu.se>.
On Wed, 13 Sep 2006, Davi Arnaut wrote:

> I'm working on this. You may want to check my proposal at 
> http://verdesmares.com/Apache/proposal.txt

Will it be possible to do away with "one file for headers and one file 
for body" in mod_disk_cache with this scheme?

The thing is that I've been pounding seriously at mod_disk_cache to 
make it able to sustain rather heavy load on not-so-heavy equipment, 
and part of that effort was to wrap headers and body into one file for 
mainly the following purposes:

* Less files, less open():s (small gain)
* Way much easier to purge old entries from the cache (huge gain).
   Simply list all files in cache, sort by atime and remove the oldest.
   The old way by using htcacheclean took ages and had less useful
   removal criteria.
* No synchronisation issues between the header file and body file,
   unlink one and it's gone.

That's only one of many changes made, but I found it to be crucial to 
be able to have an architecture that's consistent without relying on 
locks. This made it rather easy to implement stuff like serving files 
that are currently being cached from cache, reusing expired cached 
files if the originating file is found to be unmodified, and so on.

But the largest gain is still the cache cleaning process.

The stuff is used in production and seems stable, however I haven't 
had any response to the first (trivial) patch sent so I don't know if 
there's any interest in this.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     nikke@acc.umu.se
---------------------------------------------------------------------------
  Does the Little Mermaid wear an algebra?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Re: mod_cache responsibilities vs mod_xxx_cache provider responsibilities

Posted by Davi Arnaut <da...@haxent.com.br>.
On 13/09/2006, at 16:29, Issac Goldstand wrote:

> Hi all,
>   I've been hacking at mod_cache a bit, and was surprised to find that
> part of the decision to serve previously cached content or not was  
> being
> made by the backend provider and not mod_cache; specifically, the
> expiration date of the content seems to be checked by  
> mod_disk_cache (as
> part of open_entity), and if the provider check fails, mod_cache  
> doesn't
> even know about the entity (and therefore, in the case of a caching
> proxy,  can't treat it as a possibly stale entity upon which it can  
> just
> do a conditional GET and possibly get a 304, rather than requiring
> mod_proxy to rerequest the entire entity again).
>
> When I originally started looking at the family of cache modules, I
> assumed that all of the decision-making logic would be in mod_cache,
> while the mod_xxx_cache providers would be "dumb" file-stores (at  
> least,
> as far as mod_cache is concerned).  Is this not the case?

I'm working on this. You may want to check my proposal at http:// 
verdesmares.com/Apache/proposal.txt

>
> If it is, would patches be acceptable if I have the time to try to
> rectify the situation (at least somewhat)?

http://verdesmares.com/Apache/patches/022.patch

I'm still working on it, things may change radically.

--
Davi Arnaut