You are viewing a plain text version of this content. The canonical link for it is here.
Posted to cvs@httpd.apache.org by ak...@hyperreal.org on 1998/07/23 18:42:05 UTC
cvs commit: apache-2.0/docs overview_aek

akosut      98/07/23 09:42:05

  Added:       docs     overview_aek
  Log:
  Add my 2.0 overview document to CVS, where it will be maintained for
  all eternity. Please make changes where appropriate.
  
  Revision  Changes    Path
  1.1                  apache-2.0/docs/overview_aek
  
  Index: overview_aek
  ===================================================================
  
  From akosut@leland.Stanford.EDU Thu Jul 23 09:38:40 1998
  Date: Sun, 19 Jul 1998 00:12:37 -0700 (PDT)
  From: Alexei Kosut <ak...@leland.Stanford.EDU>
  To: new-httpd@apache.org
  Subject: Apache 2.0 - an overview
  
  For those not at the Apache meeting in SF, and even for those who were,
  here's a quick overview of (my understanding of) the Apache 2.0
  architecture that we came up with. I present this to make sure that I have
  it right, and to get opinions from the rest of the group. Enjoy.
  
  
  1. "Well, if we haven't released 2.0 by Christmas of 1999, it won't
      matter anyway." 
  
  A couple of notes about this plan: I'm looking at this right now from a
  design standpoint, not an implementation one. If the plan herein were
  actually coded as-is, you'd get a very inefficient web server. But as
  Donald Knuth (Professor emeritus at Stanford, btw... :) points out,
  "premature optimization is the root of all evil." Rest assured there are
  plenty of ways to make sure Apache 2.0 is much faster than Apache 1.3.
  Taking out all the "slowness" code, for example... :)
  
  Also, the main ideas in this document mainly come from Dean Gaudet, Simon
  Spero, Cliff Skolnick and a bunch of other people, from the Apache Group's
  meeting in San Francisco, July 2 and 3, 1998. The other ideas come from
  other people. I'm being vague because I can't quite remember. We should
  have videotaped it.  I've titled the sections of this document with quotes
  from our meeting, but they are paraphrased from memory, so don't take them
  too seriously.
  
  2. "But Simon, how can you have a *middle* end?"
  
  One of the main goals of Apache 2.0 is protocol independence (i.e.,
  serving HTTP/1.1, HTTP-NG, and maybe FTP or gopher or something). Another
  is to rid the server of the belief that everything is a file. Towards this
  end, we divide the server up into three parts, the front end, the middle
  end, and the back end.
  
  The front end is essentially a combination of http_main and http_protocol
  today. It takes care of all network and protocol matters, interpreting the
  request, putting it into a protocol-neutral form, and (possibly) passing
  it off to the rest of the server. This is approximately equivalent to the
  part of Apache contained in Dean's flow stuff, and it also works very well
  in certain non-Unix-like architectures such as clustered mainframes. In
  addition, part of this front-end might be optionally run in kernel space,
  giving a very fast server indeed...
  
  The back end is what generates the content. At the back of the back end we
  have backing stores (Cliff's term), which contain actual data. These might
  represent files on a disk, entries in a database, CGI scripts, etc... The
  back end also consists of other modules, which can alter the request in
  various fashions. The objects the server acts on can be thought of (Cliff
  again) as a filehandle and a set of key/value pairs (metainformation).
  The modules are set up as filters that can alter either one of those,
  stacking I/O routines onto the stream of data, or altering the
  metainformation.
  
  The middle end is what comes between the front and back ends. Think of
  http_request. This section takes care of arranging the modules, backing
  stores, etc... into a manner so that the path of the request will result
  in the correct entity being delivered to the front end and sent to the
  client.
  
  3. "I won't embarrass you guys with the numbers for how well Apache
      performs compared to IIS." (on NT)
  
  For a server that was designed to handle flat files, Apache does it
  surprisingly poorly, compared with other servers that have been optimized
  for it. And the performance for non-static files is, of course, worse.
  While Apache is still more than fast enough for 95% of Web servers, we'd
  be remiss to dismiss those other 5% (they're the fun ones anyway). Another
  problem Apache has is its lack of a good, caching, proxy module.
  
  Put these together, along with the work Dean has done with the flow and
  mod_mmap_static stuff, and we realize the most important part of Apache
  2.0: a built-in, all-pervasive, cache. Every part of the request process
  will involve caching. In the path outlined above, between each layer of
  the request, between each module, sits the cache, which can (when it is
  useful), cache the response and its metainformation - including its
  variance, so it knows when it is safe to give out the cached copy. This
  gives every opportunity to increase the speed of the server by making sure
  it never has to dynamically create content more than it needs to, and
  renders accelerators such as Squid unnecessary.
  
  This also allows what I alluded to earlier: a kernel (or near-to-kernel)
  based web server component, which could read the request, consult the
  cache to find the requested object, and spit it back out, without so much
  as an interrupt in the way. Of course, the rest of Apache (with all its
  modules - it's generally a bad idea to let unknown, untrusted code, insert
  itself into the kernel) sits up in user-space, ready to handle any request
  the micro-Apache can't.
  
  A built-in cache also makes a real working HTTP/1.1 proxy server trivially
  easy to write.
  
  4. "Stop asking about backwards compatibility with the API. We'll write a
      compatibility module... later." 
  
  If modules are as described above, then obviously they are very much
  distinct from how Apache's current modules function. The only module
  function that is similar to the current model is the handler, or backing
  store, that actually provides the basic stream of data that the server
  alters to product a response entity.
  
  The basic module's approach to its job is to stack a filter onto the
  output. But it's better to think of the modules not as a stack that the
  request flows through (a layer cake with cache icing between the layers),
  but more of a mosaic (pretend I didn't use that word. I wrote collage. You
  can't prove anything), with modules stuck onto various sides of the
  request at different points, altering the request/response.
  
  Today's Apache modules take an all-or-nothing approach to request
  handlers. They tell Apache what they can do, overestimating, and then are
  supposed to DECLINE if they don't pass a number of checks they are
  supposed to make. Most modules don't do this correctly. The better
  approach is to allow the modules to inform Apache exactly of what they can
  do, and have Apache (the middle-end) take care of invoking them when
  appropriate.
  
  The final goal of all of this, of course, is simply to allow CGI output to
  be parsed for server-side includes. But don't tell Dean that.
  
  5. "Will Apache run without any of the normal Unix binaries installed,
      only the BSD/POSIX libraries?"
  
  Another major issue is, of course, configuration of the server. There are
  a number of distinct opinions on this, both as to what should be
  configured and how it should be done. We talked mainly about the latter,
  but the did touch on the former. Obviously, with a radically distinct
  module API, the configuration is radically different. We need a good way
  to specify how the modules are supposed to interact, and of controlling
  what they can do, when and how, balancing what the user asks the server to
  do, and what the module (author) wants the server to do. We didn't really
  come up with a good answer to this.
  
  However, we did make some progress on the other side of the issue: We
  agreed that the current configuration system is definitely taking the
  right approach. Having a well-defined repository of the configuration
  scheme, containing the possible directives, when they are applicable, what
  their parameters are, etc... is the right way to go. We agreed that more
  information and stronger-typing (no RAW_ARGS!) would be good, and may
  enable on-the-fly generated configuration managers.
  
  We agreed that such a program, probably external to Apache, would generate
  a configuration and pass it to Apache, either via a standard config file,
  or by calling Apache API functions. It is desirable to be able to go the
  other way, pulling current configuration from Apache to look at, and
  perhaps change it on the fly, but unfortunately is unlikely this
  information would always be available; modules may perform optimizations
  on their configuration that makes the original configuration unavailable.
  
  For the language and specification of the configuration, we thought
  perhaps XML might be a good approach, and agreed it should be looked
  into. Other issues, such as SNMP, were brought up and laughed at.
  
  6. "So you're saying that the OS that controls half the banks, and 90% of
      the airlines, doesn't even have memory protection for seperate
      processes?"
  
  Obviously, there are a lot more items that have to be part of Apache 2.0,
  and we talked about a number of them. However, the four points above, I
  think, represent the core of the architecture we agreed on as a starting
  point.
  
  -- Alexei Kosut <ak...@stanford.edu> <http://www.stanford.edu/~akosut/>
     Stanford University, Class of 2001 * Apache <http://www.apache.org> *