You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Simon Spero <se...@tipper.oit.unc.edu> on 1998/09/23 20:56:17 UTC

Re: Core server caching

[This is going to be relatively short, since my hands are messed up and my
decent voice recognition machine is in for service. Also, this voice
recognition software isn't setup to handle code.  I'll see if I can do a
brain transplant into my bosses laptop later tonight. this is all way too
fuzzy without a decent example.]


 ------------
Cache invalidation can get pretty hairy in the most complicated cases-for
example, when dealing with a module implementing a generic scripting
language; however for native modules, it can be made relatively simple.
If instead of treating dynamic objects which depend on a number of
parameters as just glorified implementations of a generic GET  or POST
method, they are instead implemented and declared as application oriented
methods, which get called by the get or post handler, it becomes a lot
easier to annotate those methods with cache validation information.  If
we're trying to lose the assumption that we're always talking HTTP back,
then this makes the design a lot cleaner.
------
some sort of I/O modules just add or subtract headers; others completely
rewrite the contents.  The former can be implemented really efficiently,
especially if high-level modules can pre-inform lower-level modules of how
much extra space they will need for headers.
------
One distinction that is useful to keep in mind is the difference between
streams that are connected to a network, and those are used to transfer
data between parts of the server.  If I'm allowed to keep going with my
three level model, the former connects the front-end to the middle-end,
and the latter connects the middle end to the back-end; one end drains
into the cache, and the other end drains out.
------
If all streams are connected via the middle-end, then intermediate stream
content can potentially be cached.
------
The job of the middle-end is to mediate such disagreements, and to apply
architecture specific optimisations as much as possible-in particular
TransMetaFile and its ilk, zero copy stacks, etc..
-------
Depending on what sort of back-end module you use, there are several
different ways of feeding data into middle-end that makes sense.  For
modules that source their data from a file, the best way to parse the data
is by filename.  For other modules, having a file handle makes more sense.
Yet others are best suited to a mechanism using mbufs/sbufs-especially
those that are performing transformations on the data. The internal object
representing this data source can emulate all of these interfaces as long
as one is implemented. If the data reaches the cache was originally
sourced from a filename, then the cache should only be keeping the
information about the mapping, and not the actual contents of a file.
------
Some types of back-end object, for example those based on files in an
environment supporting mmap, may prefer to allocate there own memory;
similarly, some types of front-end may work better if they allocate the
memory for data to be stored in.  An example of this might be a system
with a memory mapped network interface and a collaborative TCP
implementation.
----------





Re: Core server caching

Posted by Dean Gaudet <dg...@arctic.org>.

On Thu, 29 Oct 1998, Honza Pazdziora wrote:

> > 
> > I'm sure there are other applications... but none that convince me that we
> > want to spend a lot of time inventing a layered i/o model which has the
> > huge potential to slow down the server.  Look at STREAMS for an example of
> 
> Well, it would slow down the server if it would be used, if you just
> had one module to fetch the file and send it to browser, not slowdown
> would show up.

... and now fit TransmitFile or sendfile() into the solution ... 

... and fit mmap()/write() which is faster than read()/write() in many
cases ... 

If the server could only be as fast as read()/write() allows it to be then
yeah you're right.  But that's not the end of things -- we could possibly
have pages mapped into a memory cache and we want those to go through the
stack with as few copies as possible.  A naive stack copies them at every
layer.

Dean


Re: Core server caching

Posted by Honza Pazdziora <ad...@informatics.muni.cz>.
> 
> I'm sure there are other applications... but none that convince me that we
> want to spend a lot of time inventing a layered i/o model which has the
> huge potential to slow down the server.  Look at STREAMS for an example of

Well, it would slow down the server if it would be used, if you just
had one module to fetch the file and send it to browser, not slowdown
would show up.

Let me show an application that's on my mind for quite a long time:

Page:
	Top:			Middle:				Footer:

			   out to browser
				|
			      gzip
	+-----------------------+-------------------------------+
 recode from utf to ___		|				plain file
	|		recode from some other charset
 fetch from Oracle db		|
			    XML -> HTML
				|
			some Perl (mod_perl) or CGI

We are able to get near this model in mod_perl with Apache::Mason
and OutputChain. But it's not very nice because it's hard to separate
the global state from the local, on the interface between the modules.
Also, sending the headers down the tree (during initialization) and
back during output processing has some bugs.

There are some proposals on the mod_perl mailing list for some new
features and I'd like to implement them, but I do not want to diverge
from how the 2.0 will look like.

Is there doing to be more than just a stack of filters, like the tree
above?

------------------------------------------------------------------------
 Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/
                   I can take or leave it if I please
------------------------------------------------------------------------

Re: Core server caching

Posted by Honza Pazdziora <ad...@informatics.muni.cz>.
> 
> One recent feature that was added to PHP was an option to gzip the output
> stream of the module.  It works fine, but architecturally I don't think
> PHP should be responsible for something like that.  Others have mentioned

Yes. Forcing any module to support gzipping is bad, IMHO. And even if
we only had the gzip filtering module, I think it's worth implementing
it cleanly.

As for the SSI (or any other "including" module), it should also be
done as filter -- CGI could do it itself, yes, but once the number of
different sources that should be considered (plain file, script,
Apache::Registry precompiled script, database record) gets higher, so
grows the number of modules that would have to implement SSI in such
a way. Isn't it better to do it once and define one,
as-few-copies-as-possible solution?

And how about converting PNG to GIF (and caching the result). Even the
picture can come either from file, Perl script with GD or database.

This would not force anyone to use it, anyone could still be doing it
in his script, of course.

------------------------------------------------------------------------
 Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/
                   I can take or leave it if I please
------------------------------------------------------------------------

Re: Core server caching

Posted by Ben Hyde <bh...@pobox.com>.
Dean Gaudet writes:
>On Thu, 29 Oct 1998, Rasmus Lerdorf wrote:
>
>> There are also weird and wacky things you would be able to do if you could
>> stack mod_php on top of mod_perl.
>
>You people scare me.
>
>Isn't that redundant though?
>
>Dean

Yes it's scary, but oddly erotic, when these behemoths with their
gigantic interpreters try to mate.

It's interesting syndrome, systems as soon as they get an interpreter
they tend to loose their bearings and grow into vast behemoths that
lumber about slowly crushing little problems with their vast mass.
Turing syndrome?

I've heard people say modules can help avoid this, but I've rarely
seen it.  Olde Unix kinda manages it remember being frightened by
awk.

Can we nudge alloc.c/buff.c toward a bit of connective glue that
continues to let individual modules evolve their own gigantism while
avoiding vile effects on the core performance of the server?  Stuff
like this:

  memory chunk alignment for optimal I/O
  memory hand off along the pipeline
  memory hand off crossing pool boundaries
  memory hand off in zero copy cases
  transmit file
  transmit cache elements
  insert/remove cache elements
  leverage unique hardware and instructions

That memcpy in ap_bread really bugs me.

I'd be rather have routines that let me handoff chunks.  Presumably
these would need to be able to move chunks across pool and buffer
boundaries.  But zero copy if I don't touch the content and never a
memcpy just to let my lex the input.

I've built systems like this with the buffers exposing a emacs
buffer style of abstraction, but with special kinds of marks
to denote what's released for sending, and what's been accepted
and lex'd on the input side.  It does create mean all your
lexical and printf stuff has to be able to smoothly slide
over chunk boundaries.

 - ben

Re: Core server caching

Posted by Rasmus Lerdorf <ra...@lerdorf.on.ca>.
> > There are also weird and wacky things you would be able to do if you could
> > stack mod_php on top of mod_perl.
> 
> You people scare me.
> 
> Isn't that redundant though?

Well yeah, to some extent I suppose it is similar to the mod_cgi vs.
mod_include relationship.  But you can do some funky things with mod_perl
that you can't do with PHP.  You have to know what you are doing though.
I could see an ISP doing some low-level mod_perl trickery in their server
and have mod_php be the embedded scripting language they expose to the
Joe-average user. 

And heck, I want to do something useful with the Notes table!  ;)

-Rasmus


Re: Core server caching

Posted by Dean Gaudet <dg...@arctic.org>.

On Thu, 29 Oct 1998, Rasmus Lerdorf wrote:

> There are also weird and wacky things you would be able to do if you could
> stack mod_php on top of mod_perl.

You people scare me.

Isn't that redundant though?

Dean



Re: Core server caching

Posted by Rasmus Lerdorf <ra...@lerdorf.on.ca>.
> On Thu, 29 Oct 1998, Rodent of Unusual Size wrote:
> 
> > I've heard this 'CGI can do anything directly that an SSI can do'
> > waffle before.  While true, it's reinventing the wheel in the
> > CGI.
> 
> Which is why I suggested php. 
> 
> > So Dean may find it boring, but that doesn't necessarily make
> > it so as an absolute.
> 
> I don't find it motivating enough an application -- I want other more
> motivating "killer" applications if you will.  It's always worthwhile
> looking at a desired feature that's going to take a lot of code to add
> (filters, layered i/o) and ask "ok what will it really be used for?"

One recent feature that was added to PHP was an option to gzip the output
stream of the module.  It works fine, but architecturally I don't think
PHP should be responsible for something like that.  Others have mentioned
it, but a compression layer sounds like a pretty good example of
something that layered i/o is perfect for.

Another thing that was added to PHP is an XML parser (expat).  This could
potentially be a separate layer as well.  

There are also weird and wacky things you would be able to do if you could
stack mod_php on top of mod_perl.

-Rasmus


Re: Core server caching

Posted by Dean Gaudet <dg...@arctic.org>.

On Thu, 29 Oct 1998, Rodent of Unusual Size wrote:

> I've heard this 'CGI can do anything directly that an SSI can do'
> waffle before.  While true, it's reinventing the wheel in the
> CGI.

Which is why I suggested php. 

> So Dean may find it boring, but that doesn't necessarily make
> it so as an absolute.

I don't find it motivating enough an application -- I want other more
motivating "killer" applications if you will.  It's always worthwhile
looking at a desired feature that's going to take a lot of code to add
(filters, layered i/o) and ask "ok what will it really be used for?"

Dean



Re: Core server caching

Posted by Rodent of Unusual Size <Ke...@Golux.Com>.
Dean Gaudet wrote:
> 
> Dean wants examples of useful applications of this which don't amount to
> complete kludges... CGI -> SSI is a total kludge, you can do everything
> the SSI can do from the CGI itself.  Or you could use a real language like
> perl or php and get rid of the SSI that way.  So CGI->SSI is a really
> boring application.

The question was about Dean's reasoning, and he's answered it
fairly.  I just want to go on record with a differing opinion.

I've heard this 'CGI can do anything directly that an SSI can do'
waffle before.  While true, it's reinventing the wheel in the
CGI.  Consider a CGI script that picks and displays a set of
files based on the POTM (or whatever) after setting some
environment variables.  It's perfectly reasonable to want those
files to be parsed in order to maintain consistency and use
the envariables the script set up.  Otherwise the script would
have to do the parsing itself.  If scripts and parsed files
are supposed to have some sort of common format for their output,
a simple #include handles all cases for the files, but
each script needs to be manually modified to keep the layout in sync.

So Dean may find it boring, but that doesn't necessarily make
it so as an absolute.  I among others would like to see this
capability, and don't consider it a kludge at all.  Not the
optimal solution, possibly, but not a kludge either. :-)

#ken	P-)}

Ken Coar                    <http://Web.Golux.Com/coar/>
Apache Group member         <http://www.apache.org/>
"Apache Server for Dummies" <http://Web.Golux.Com/coar/ASFD/>

Re: Core server caching

Posted by Dean Gaudet <dg...@arctic.org>.

On Thu, 29 Oct 1998, Ed Korthof wrote:

> How about gzip'ing content on the fly, or translating it to another
> charset?

Yeah these both came up during the hike.  They're much better example
applications in my opinion.

> The ones which come to mind are HTTP-NG and other HTTP extensions -- it
> would be easy to make Apache provide these services using layered I/O.
> W/o layered I/O, it means more hacks in the core.  Those are things I'd
> love to see Apache support...

You mean MUX?  All we need for MUX is a single layer, the bottom layer
read/write/accept. 

> OTOH, simply providing for that would require much less than some of the
> proposals which we've discussed.

Yeah :)

Dean


Re: Core server caching

Posted by Ed Korthof <ed...@bitmechanic.com>.
On Thu, 29 Oct 1998, Dean Gaudet wrote:

> On Thu, 29 Oct 1998, Honza Pazdziora wrote:
> 
> > I've also found the following sentence in the docs:
> > 	The final goal of all of this, of course, is simply to allow CGI
> > 	output to be parsed for server-side includes. But don't tell Dean that.
> > 
> > What does make Dean so sad about the idea? ;-)
> 
> Dean wants examples of useful applications of this which don't amount to
> complete kludges... CGI -> SSI is a total kludge, you can do everything
> the SSI can do from the CGI itself.  Or you could use a real language like
> perl or php and get rid of the SSI that way.  So CGI->SSI is a really
> boring application. 

How about gzip'ing content on the fly, or translating it to another
charset?

> Crypto is an interesting application, but there are these stupid export
> laws which mean we can't really say "this is *the* reason we did this". 

The ones which come to mind are HTTP-NG and other HTTP extensions -- it
would be easy to make Apache provide these services using layered I/O.
W/o layered I/O, it means more hacks in the core.  Those are things I'd
love to see Apache support...

OTOH, simply providing for that would require much less than some of the
proposals which we've discussed.

> I'm sure there are other applications... but none that convince me that we
> want to spend a lot of time inventing a layered i/o model which has the
> huge potential to slow down the server.  Look at STREAMS for an example of
> a protocol stack which is a complete and utter performance failure... we
> shouldn't repeat that.  Every vendor who had TCP/IP over STREAMS has
> replaced it with various fast path hacks to divert things away from
> STREAMS and into a custom TCP/IP stack as early as possible.

Efficiency is a real concern, but w/ the last proposal based on the stuff
Alexei and I worked out, I think the only questionable area was w/ reading
data; and as you said, using reference counts provides a reasonable way to
handle this.  Reference counts kinda suck in C, but given the pool model
which we have, I think it can be done relatively cleanly.  I'll post a
specific proposal if you'd like.

I think it should be possible to do this (in a less than completely
general way) w/o a heavy impact on efficiency... but only if we give up
some potential features (IMO, of course).

Ed



Re: Core server caching

Posted by Dean Gaudet <dg...@arctic.org>.

On Thu, 29 Oct 1998, Honza Pazdziora wrote:

> I've also found the following sentence in the docs:
> 	The final goal of all of this, of course, is simply to allow CGI
> 	output to be parsed for server-side includes. But don't tell Dean that.
> 
> What does make Dean so sad about the idea? ;-)

Dean wants examples of useful applications of this which don't amount to
complete kludges... CGI -> SSI is a total kludge, you can do everything
the SSI can do from the CGI itself.  Or you could use a real language like
perl or php and get rid of the SSI that way.  So CGI->SSI is a really
boring application. 

Crypto is an interesting application, but there are these stupid export
laws which mean we can't really say "this is *the* reason we did this". 

I'm sure there are other applications... but none that convince me that we
want to spend a lot of time inventing a layered i/o model which has the
huge potential to slow down the server.  Look at STREAMS for an example of
a protocol stack which is a complete and utter performance failure... we
shouldn't repeat that.  Every vendor who had TCP/IP over STREAMS has
replaced it with various fast path hacks to divert things away from
STREAMS and into a custom TCP/IP stack as early as possible.

Dean



Re: Core server caching

Posted by Honza Pazdziora <ad...@informatics.muni.cz>.
> 
> >some sort of I/O modules just add or subtract headers; others completely
> >rewrite the contents.  The former can be implemented really efficiently,
> >especially if high-level modules can pre-inform lower-level modules of how
> >much extra space they will need for headers.
> 
> If things pushed late into the I/O stack are going to add headers you
> have then you stuck blocking I/O for the entire response until they
> unblock it.  Presumably the initial choice of response generator can
> select if blocking is or isn't required.

If you have a CGI script and some module behaves differently on
different headers it gets from the script, this may influence the
bloking as well, can't it?

Another idea: do you think it is reasonable to consider the content
passed between the modules will be in some other format then just raw
bytes? I mean parsed HTML/XML for example? I we have a CGI script and
want to process its output with a HTML parser, it might be good to
know that some module that will be taking the output from the first
filter will want it as HTML as well, so it'll be faster to pass the
structure and avoid the second parsing.

I've also found the following sentence in the docs:
	The final goal of all of this, of course, is simply to allow CGI
	output to be parsed for server-side includes. But don't tell Dean that.

What does make Dean so sad about the idea? ;-)

------------------------------------------------------------------------
 Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/
                   I can take or leave it if I please
------------------------------------------------------------------------

Re: Core server caching

Posted by Ben Hyde <bh...@pobox.com>.
Simon Spero writes:
>[This is going to be relatively short...]
my empathy on the hands I assure you.

>Cache invalidation can get pretty hairy in the most complicated cases-for
>example, when dealing with a module implementing a generic scripting
>language; however for native modules, it can be made relatively simple.
>If instead of treating dynamic objects which depend on a number of
>parameters as just glorified implementations of a generic GET  or POST
>method, they are instead implemented and declared as application oriented
>methods, which get called by the get or post handler, it becomes a lot
>easier to annotate those methods with cache validation information.  If
>we're trying to lose the assumption that we're always talking HTTP back,
>then this makes the design a lot cleaner.

Yes. The range of useful cache invalidation heuristics makes it difficult
to design a one size fits all invalidation scheme in the core.  While
a call back to validate is a good idea I think it can be put back a
little in the I/O stack from the base entry in that stack, but I do
think the base of the stack needs to have a cache of things it can
send quickly.

>some sort of I/O modules just add or subtract headers; others completely
>rewrite the contents.  The former can be implemented really efficiently,
>especially if high-level modules can pre-inform lower-level modules of how
>much extra space they will need for headers.

If things pushed late into the I/O stack are going to add headers you
have then you stuck blocking I/O for the entire response until they
unblock it.  Presumably the initial choice of response generator can
select if blocking is or isn't required.

>One distinction that is useful to keep in mind is the difference between
>streams that are connected to a network, and those are used to transfer
>data between parts of the server.  If I'm allowed to keep going with my
>three level model, the former connects the front-end to the middle-end,
>and the latter connects the middle end to the back-end; one end drains
>into the cache, and the other end drains out.

I'm not my model of "levels" and yours are in synch.  But yes.  As
soon as there is an I/O pipeline (or stack) then the protocol between
elements of the stack is, presumably, something different from HTTP.


>If all streams are connected via the middle-end, then intermediate stream
>content can potentially be cached.

Yes.

My model of what flows on the pipeline is response_chunks of various
flavors and further more that it is useful to have send_cached_chunk
as one of those flavors at the bottom most stage in the pipeline.

>The job of the middle-end is to mediate such disagreements, and to apply
>architecture specific optimisations as much as possible-in particular
>TransMetaFile and its ilk, zero copy stacks, etc..

I see that as problem of designing a good I/O pipeline/stack,
i.e. that is can rapidly pass the high performance cases at
minimum overhead.  How many layers get stacked up varies both
from application to application as well as during the generation
of a single response.

>Depending on what sort of back-end module you use, there are several
>different ways of feeding data into middle-end that makes sense.  For
>modules that source their data from a file, the best way to parse the data
>is by filename.  For other modules, having a file handle makes more sense.
>Yet others are best suited to a mechanism using mbufs/sbufs-especially
>those that are performing transformations on the data. The internal object
>representing this data source can emulate all of these interfaces as long
>as one is implemented. If the data reaches the cache was originally
>sourced from a filename, then the cache should only be keeping the
>information about the mapping, and not the actual contents of a file.

Interesting.  My model was that the content generator (input of the
pipeline) would delimit portions of his output as being "cachable"
getting a cache id in return which he would reuse latter.  Meanwhile
at the tail end of the pipeline it would record the stuff between
the brackets for replay latter when requested.  It could of course
decide what to record, either a sequence of chunks or the
concatenation of those chunks into something it can ship out even
faster.  If he records the a transmit file chunk, and not the content
of that file then the guy holding the cache id must know that.

>Some types of back-end object, for example those based on files in an
>environment supporting mmap, may prefer to allocate there own memory;
>similarly, some types of front-end may work better if they allocate the
>memory for data to be stored in.  An example of this might be a system
>with a memory mapped network interface and a collaborative TCP
>implementation.

That's an example of why I want response chunks that flow down
the pipeline to be extensible with an atomic set at the base.

The pipeline design is a lot like the module API design problem.
It would be wonderful if it all works well enough that the
"module" can reside at assorted distances from the server
(same thread, same process, different process, different machine).

It seems to me useful to admit that the protocol/api/pipeline
within the server site can be and ought to be richer than HTTP is.

 - ben hyde