You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Colm MacCarthaigh <co...@stdlib.net> on 2005/08/12 17:46:29 UTC

[PATCH] Make caching hash more deterministic

Currently;

	GET / HTTP/1.1
	Host: ftp.heanet.ie

	GET http://ftp.heanet.ie/ HTTP/1.0

	GET HTTP://Ftp.Heanet.Ie/ HTTP/1.0

are all mapped to different hashes by mod_cache; despite being the same
content, this is an inefficient waste of disk space and really awkward
for me trying to write a debug/admin tool.

The attached patch makes it deterministic, by mapping them all to;

	"http://ftp.heanet.ie:80/?" 

Instead of "ftp.heanet.ie/?". For for a cached webserver, this really
won't make much of a difference since the Host-header is forcably
lower-cased anyway, but for a proxy it definitely helps.  Looking
through my logs I'm seeing lots of simple domain case variations - no
point storing them twice and handling all of the expires multiple times.

It also solves the colision that happens if aan administrator wants to
run Apache listening on multiple ports, but has mod_cache enabled. 

The only awkwardness I can see with this approach, is that;

	GET / HTTP/1.0

would look like this;

	"http://:80/?"

So, I've re-used the _default_ "convention" (underscores are not
permitted in DNS anyway) for such keys;

	"http://_default_:80/?"

Which should at least make a familiar sort of sence to an administrator.

-- 
Colm MacCárthaigh                        Public Key: colm+pgp@stdlib.net

Re: [PATCH] Make caching hash more deterministic

Posted by Jim Jagielski <ji...@jaguNET.com>.
On Aug 15, 2005, at 4:10 AM, Colm MacCarthaigh wrote:

> On Sat, Aug 13, 2005 at 10:29:54AM +0200, Graham Leggett wrote:
>
>> The idea of canonicalising the name is sound, but munging them  
>> into an
>> added :80 and an added ? is really ugly - these are not the kind  
>> of URLs
>> that an end user would understand at a glance if they had to see them
>> listed.
>>
>
> An end-user should never see these keys, the only place they are  
> visible
> to any user is the semi-binary mod_disk_cache files. An administrator
> would have to really know what they're doing to find them, or be using
> htcacheadmin - once I finish that, and if it gets accepted.
>
>
>> Is it possible to remove the :80 if the scheme is https, and  
>> remove the
>> :443 if the scheme is https:? What is the significance of the  
>> added "?"?
>>
>
> The "?" isn't me, that's current mod_cache behaviour, so I left it
> alone.
>
> It doesn't have any significance except for avoiding an extra  
> condition.
> r->args is part of the key aswell, it just happens to have been  
> NULL in
> those examples.
>
> Either way, doing as you suggest is trivial, but is there really a  
> point
> adding more conditions? Any tool which does inspect the cache files  
> can
> clean it up for presentation to the administrator.
>

I think Colm has a valid point... since these are internal  
"representations"
then the cleaning up would best be done by the actual view process.
I would imagine that keeping the internal representations consistent
would streamline the actual functional aspects.

Re: [PATCH] Make caching hash more deterministic

Posted by Graham Leggett <mi...@sharp.fm>.
Colm MacCarthaigh wrote:

> An end-user should never see these keys, the only place they are visible
> to any user is the semi-binary mod_disk_cache files. An administrator
> would have to really know what they're doing to find them, or be using
> htcacheadmin - once I finish that, and if it gets accepted. 

Ok cool - I was not sure whether the URL list would be mined for data 
for any reason.

> Either way, doing as you suggest is trivial, but is there really a point
> adding more conditions? Any tool which does inspect the cache files can
> clean it up for presentation to the administrator.

In that case you're right - the less work done on the URL, the faster it 
will be.

Regards,
Graham
--

Re: [PATCH] Make caching hash more deterministic

Posted by Colm MacCarthaigh <co...@stdlib.net>.
On Sat, Aug 13, 2005 at 10:29:54AM +0200, Graham Leggett wrote:
> The idea of canonicalising the name is sound, but munging them into an 
> added :80 and an added ? is really ugly - these are not the kind of URLs 
> that an end user would understand at a glance if they had to see them 
> listed.

An end-user should never see these keys, the only place they are visible
to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted. 

> Is it possible to remove the :80 if the scheme is https, and remove the 
> :443 if the scheme is https:? What is the significance of the added "?"?

The "?" isn't me, that's current mod_cache behaviour, so I left it
alone.

It doesn't have any significance except for avoiding an extra condition.
r->args is part of the key aswell, it just happens to have been NULL in
those examples.

Either way, doing as you suggest is trivial, but is there really a point
adding more conditions? Any tool which does inspect the cache files can
clean it up for presentation to the administrator.

-- 
Colm MacCárthaigh                        Public Key: colm+pgp@stdlib.net

Re: [PATCH] Make caching hash more deterministic

Posted by Graham Leggett <mi...@sharp.fm>.
Colm MacCarthaigh wrote:

> Currently;
> 
> 	GET / HTTP/1.1
> 	Host: ftp.heanet.ie
> 
> 	GET http://ftp.heanet.ie/ HTTP/1.0
> 
> 	GET HTTP://Ftp.Heanet.Ie/ HTTP/1.0
> 
> are all mapped to different hashes by mod_cache; despite being the same
> content, this is an inefficient waste of disk space and really awkward
> for me trying to write a debug/admin tool.
> 
> The attached patch makes it deterministic, by mapping them all to;
> 
> 	"http://ftp.heanet.ie:80/?" 

The idea of canonicalising the name is sound, but munging them into an 
added :80 and an added ? is really ugly - these are not the kind of URLs 
that an end user would understand at a glance if they had to see them 
listed.

Is it possible to remove the :80 if the scheme is https, and remove the 
:443 if the scheme is https:? What is the significance of the added "?"?

Regards,
Graham
--



Re: [PATCH] Make caching hash more deterministic

Posted by Colm MacCarthaigh <co...@stdlib.net>.
On Fri, Aug 12, 2005 at 01:34:50PM -0400, Jim Jagielski wrote:
> >Here's a more involved patch that gets the logic right. It's 6pm on a
> >Friday for me, so I have only tested it a little, but thought I'd  
> >share
> >for comment before the weekend.
> >
> 
> +1 on inspection... testing to be done over
> the weekend :)

Of course :) I've run http local and proxy cases, and ftp proxy cases,
as well as a few odd things now. With UseCanonicalName on, it does
improve the hitrates. 

I've changed the patch a little (attached) but only some cosmetic
comment changes, and I ditched the "local://" uri;

Thinking about it, for a cache to be shared amongst protocols, things
like the connection port would have to be faked anyway. So might aswell
include the real serving protocol - makes much more sense to
administrators.

-- 
Colm MacCárthaigh                        Public Key: colm+pgp@stdlib.net

Re: [PATCH] Make caching hash more deterministic

Posted by Jim Jagielski <ji...@jaguNET.com>.
On Aug 12, 2005, at 1:10 PM, Colm MacCarthaigh wrote:

> On Fri, Aug 12, 2005 at 04:59:20PM +0100, Colm MacCarthaigh wrote:
>
>> On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
>>
>>> Should this honor usecanonicalname?  If so, could just use
>>> ap_get_servername(r) in stead of r->hostname.  This may further  
>>> compact
>>> the number of entries.
>>>
>>
>> Yes, but I think there'd have to be additional code to detect the  
>> proxy
>> cases. And you pointing that out has just reminded me of a bug in my
>> patch - it doesn't work for;
>>
>>     GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0
>>
>> I'll go make that work too.
>>
>
> Here's a more involved patch that gets the logic right. It's 6pm on a
> Friday for me, so I have only tested it a little, but thought I'd  
> share
> for comment before the weekend.
>

+1 on inspection... testing to be done over
the weekend :)

Re: [PATCH] Make caching hash more deterministic

Posted by Colm MacCarthaigh <co...@stdlib.net>.
On Fri, Aug 12, 2005 at 04:59:20PM +0100, Colm MacCarthaigh wrote:
> On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
> > Should this honor usecanonicalname?  If so, could just use 
> > ap_get_servername(r) in stead of r->hostname.  This may further compact 
> > the number of entries.
> 
> Yes, but I think there'd have to be additional code to detect the proxy
> cases. And you pointing that out has just reminded me of a bug in my
> patch - it doesn't work for;
> 
> 	GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0
> 
> I'll go make that work too.

Here's a more involved patch that gets the logic right. It's 6pm on a
Friday for me, so I have only tested it a little, but thought I'd share
for comment before the weekend.

-- 
Colm MacCárthaigh                        Public Key: colm+pgp@stdlib.net

Re: [PATCH] Make caching hash more deterministic

Posted by Colm MacCarthaigh <co...@stdlib.net>.
On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
> Should this honor usecanonicalname?  If so, could just use 
> ap_get_servername(r) in stead of r->hostname.  This may further compact 
> the number of entries.

Yes, but I think there'd have to be additional code to detect the proxy
cases. And you pointing that out has just reminded me of a bug in my
patch - it doesn't work for;

	GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0

I'll go make that work too.

-- 
Colm MacCárthaigh                        Public Key: colm+pgp@stdlib.net

Re: [PATCH] Make caching hash more deterministic

Posted by Brian Akins <br...@turner.com>.
Should this honor usecanonicalname?  If so, could just use 
ap_get_servername(r) in stead of r->hostname.  This may further compact 
the number of entries.



-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies