You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Dean Gaudet <dg...@arctic.org> on 1997/07/22 10:46:22 UTC

mod_include heuristic

Ok, here is something that I think I would be happy with:  use boyer-moore
and mmap() in mod_include to speed it up.  Then use a quick and dirty
two-pass heuristic to calculate Last-Modified (and ETag) provided that all
of the directives are truly static.  The first pass aborts as soon as it
encounters something non-static. 

The end result will probably be about the same performance as the existing
mod_include.  With the added benefit of Last-Modifieds for caches to chew
on.  (Plus even Content-Lengths.)  This eliminates much of the need for
XBitHack, certainly enough that we don't have to consider other extensions
to it. 

The directive "IncludesTwoPassThresh NNN" would indicate that two-pass
should be aborted whenever NNN bytes have been read from the inputs...
which lets it be disable, and prevents it from being a problem on large
inputs. 

Oh yeah, another implementation detail, it has to generate clever ETags. 
I suggest "inode-mtime" pairs for each file, in the order of inclusion. 
Abort the two pass if the length of the ETag goes over 128 bytes. 

What think you?

Dean


Re: mod_include heuristic

Posted by Ben Laurie <be...@algroup.co.uk>.
Dean Gaudet wrote:
> 
> Oh yeah, that'd be cool.  Just MD5 the etag as I'm collecting them... far
> cheaper than md5ing the entire output.

Or even - switch to MD5 when you go over 128...

Cheers,

Ben.

-- 
Ben Laurie                Phone: +44 (181) 994 6435  Email:
ben@algroup.co.uk
Freelance Consultant and  Fax:   +44 (181) 994 6472
Technical Director        URL: http://www.algroup.co.uk/Apache-SSL
A.L. Digital Ltd,         Apache Group member (http://www.apache.org)
London, England.          Apache-SSL author

Re: mod_include heuristic

Posted by Dean Gaudet <dg...@arctic.org>.
Oh yeah, that'd be cool.  Just MD5 the etag as I'm collecting them... far
cheaper than md5ing the entire output. 

Dean

On Tue, 22 Jul 1997, Ben Laurie wrote:

> Dean Gaudet wrote:
> > Oh yeah, another implementation detail, it has to generate clever ETags.
> > I suggest "inode-mtime" pairs for each file, in the order of inclusion.
> > Abort the two pass if the length of the ETag goes over 128 bytes.
> > 
> > What think you?
> 
> How about an MD5 (or other hash) of the above - then there's no limit on
> size.
> 
> Cheers,
> 
> Ben.
> 
> -- 
> Ben Laurie                Phone: +44 (181) 994 6435  Email:
> ben@algroup.co.uk
> Freelance Consultant and  Fax:   +44 (181) 994 6472
> Technical Director        URL: http://www.algroup.co.uk/Apache-SSL
> A.L. Digital Ltd,         Apache Group member (http://www.apache.org)
> London, England.          Apache-SSL author
> 


Re: mod_include heuristic

Posted by Ben Laurie <be...@algroup.co.uk>.
Dean Gaudet wrote:
> Oh yeah, another implementation detail, it has to generate clever ETags.
> I suggest "inode-mtime" pairs for each file, in the order of inclusion.
> Abort the two pass if the length of the ETag goes over 128 bytes.
> 
> What think you?

How about an MD5 (or other hash) of the above - then there's no limit on
size.

Cheers,

Ben.

-- 
Ben Laurie                Phone: +44 (181) 994 6435  Email:
ben@algroup.co.uk
Freelance Consultant and  Fax:   +44 (181) 994 6472
Technical Director        URL: http://www.algroup.co.uk/Apache-SSL
A.L. Digital Ltd,         Apache Group member (http://www.apache.org)
London, England.          Apache-SSL author

Re: mod_include heuristic

Posted by Dean Gaudet <dg...@arctic.org>.
On Tue, 22 Jul 1997, Rob Hartill wrote:

> What's to say that /foo.html isn't being updated by an external program -
> a setup I make use of.

Then /foo.html's time stamp is going to change.

"truly static" means:
- no echos
- include virtual, if I can determine that it's a recursive mod_include
  invocation, probably not
- include file -- note it will use the timestamp on all files
- no conditionals
- no variable expansion
- no exec

It's a really cheap heuristic.  Aimed at public sites that make
mod_include run all html files. 

> mod_expires does work well with caches.

Yes, and I can count the number of North American sites with admins that
are clued or conscientious enough to turn it on on the fingers of one
hand.  Ok maybe two hands.  I'm aiming for a solution that can be enabled
by default. 

Even hotwired doesn't use expires because I found it too complicated with
the way their prod process works to put something reasonable together.  I
also couldn't justify the time spent to integrate it into the production
process because the powers that be didn't understand the need for it. 

> > (Plus even Content-Lengths.)  This eliminates much of the need for
> > XBitHack, certainly enough that we don't have to consider other extensions
> > to it. 
> 
> There are practical ways to work around XBitHack and be cache-friendly.

Yes, see my last comment.  I want something that is enabled by default... 

Dean



Re: mod_include heuristic

Posted by Rob Hartill <ro...@imdb.com>.
On Tue, 22 Jul 1997, Dean Gaudet wrote:

> Ok, here is something that I think I would be happy with:  use boyer-moore
> and mmap() in mod_include to speed it up.  Then use a quick and dirty
> two-pass heuristic to calculate Last-Modified (and ETag) provided that all
> of the directives are truly static.

what do you mean by "directives are truly static" here ?

Do you mean <!--#include virtual="/foo.html" -->  is considered "static" ?.

What's to say that /foo.html isn't being updated by an external program -
a setup I make use of.

> The first pass aborts as soon as it
> encounters something non-static. 
> 
> The end result will probably be about the same performance as the existing
> mod_include.  With the added benefit of Last-Modifieds for caches to chew
> on.

mod_expires does work well with caches.

> (Plus even Content-Lengths.)  This eliminates much of the need for
> XBitHack, certainly enough that we don't have to consider other extensions
> to it. 

There are practical ways to work around XBitHack and be cache-friendly.

> The directive "IncludesTwoPassThresh NNN" would indicate that two-pass
> should be aborted whenever NNN bytes have been read from the inputs...
> which lets it be disable, and prevents it from being a problem on large
> inputs. 
> 
> Oh yeah, another implementation detail, it has to generate clever ETags. 
> I suggest "inode-mtime" pairs for each file, in the order of inclusion. 
> Abort the two pass if the length of the ETag goes over 128 bytes. 
> 
> What think you?

I like the bit about disabling it  :-)
 

--
Rob Hartill                              Internet Movie Database (Ltd)
http://www.moviedatabase.com/   .. a site for sore eyes.


RE: mod_include heuristic

Posted by Dean Gaudet <dg...@arctic.org>.
I had another idea on how to do this in a more general way.

Attached a "fake" client to r->connection->client, essentially attach a
big buffer.  Now it's easy to snarf the output of subrequests.  But you
still need to find the headers from subrequests before they're tossed
away.  An API phase called pre_handler that runs during run_sub_req()
would be the best.  This way you're not doing extra work on requests that
are being tossed away, you only do the work on requests that are actually
going to be run. 

Now, ok, so you've got your hooks into the subrequests, and you can steal
output.  What do you do with headers?

- the subrequest must have a Last-Modified, if it doesn't then abort the
    entire first pass, otherwise take the max of last-modified and whatever
    you've seen so far
- if the subrequest does not have an ETag then abort the entire first pass
    otherwise append the ETag to the current ETag.
- if the subrequest has an Expires, take the min of any current expires and
    the subrequest Expires

Or something like that.

recursive requests become challenging.  Needs more thought.

BTW the same gear could be used to do the mod_cgi, Content-Length
generation.

This solution feels 2.0ish.

Dean

On Tue, 22 Jul 1997, Lars Eilebrecht wrote:

> According to Dean Gaudet:
> 
> > Ok, here is something that I think I would be happy with:  use boyer-moore
> > and mmap() in mod_include to speed it up.  Then use a quick and dirty
> > two-pass heuristic to calculate Last-Modified (and ETag)
> 
> and ideally "Content-MD5" if ContentDigest is enabled.
> 
> > provided that all of the directives are truly static.  The first pass aborts
> > as soon as it encounters something non-static. 
> 
> If there's an <!--#include virtual="foobar.sh" --> somewhere in the document
> you may need to check if it's really static (eg. included as-is) or if it is
> maybe a CGI script.
> This was a problem I didn't solve (was to lazy to solve ;-) when I hacked my
> old NCSA server to output Last-Modified headers with SSIs. I'm now using a
> derived version (written by a friend) on my Apache servers (the infamous
> SSILMHACK, there is a PR for it I think)
>  
> > The end result will probably be about the same performance as the existing
> > mod_include.
> 
> Sounds great.
> 
> [...]
> > The directive "IncludesTwoPassThresh NNN" would indicate that two-pass
> > should be aborted whenever NNN bytes have been read from the inputs...
> > which lets it be disable, and prevents it from being a problem on large
> > inputs.
> 
> Hmmm... do we really need this? Imagine the following: if a resource
> (a big one) is rarely access it doesn't hurt Apache if he has to
> parse it twice. If the resource is frequently accessed it maybe often
> cached in a proxy-cache resulting in less hits on the server.
> But if the pass is aborted due to "IncludesTwoPassThresh" it cannot be
> cached. This may results in a higher load on the server (depending on
> how big NNN is).
> Maybe a per-directory directive that completely disables the two-pass
> variant is more useful (eg. "DisableTwoPassIncludes").
> 
> Just some esoteric thoughs... :-)
> 
> ciao... 
> -- 
> Lars Eilebrecht
> sfx@unix-ag.org
> 


RE: mod_include heuristic

Posted by Lars Eilebrecht <La...@unix-ag.org>.
According to Dean Gaudet:

> Ok, here is something that I think I would be happy with:  use boyer-moore
> and mmap() in mod_include to speed it up.  Then use a quick and dirty
> two-pass heuristic to calculate Last-Modified (and ETag)

and ideally "Content-MD5" if ContentDigest is enabled.

> provided that all of the directives are truly static.  The first pass aborts
> as soon as it encounters something non-static. 

If there's an <!--#include virtual="foobar.sh" --> somewhere in the document
you may need to check if it's really static (eg. included as-is) or if it is
maybe a CGI script.
This was a problem I didn't solve (was to lazy to solve ;-) when I hacked my
old NCSA server to output Last-Modified headers with SSIs. I'm now using a
derived version (written by a friend) on my Apache servers (the infamous
SSILMHACK, there is a PR for it I think)
 
> The end result will probably be about the same performance as the existing
> mod_include.

Sounds great.

[...]
> The directive "IncludesTwoPassThresh NNN" would indicate that two-pass
> should be aborted whenever NNN bytes have been read from the inputs...
> which lets it be disable, and prevents it from being a problem on large
> inputs.

Hmmm... do we really need this? Imagine the following: if a resource
(a big one) is rarely access it doesn't hurt Apache if he has to
parse it twice. If the resource is frequently accessed it maybe often
cached in a proxy-cache resulting in less hits on the server.
But if the pass is aborted due to "IncludesTwoPassThresh" it cannot be
cached. This may results in a higher load on the server (depending on
how big NNN is).
Maybe a per-directory directive that completely disables the two-pass
variant is more useful (eg. "DisableTwoPassIncludes").

Just some esoteric thoughs... :-)

ciao... 
-- 
Lars Eilebrecht
sfx@unix-ag.org