You are viewing a plain text version of this content. The canonical link for it is here.
Posted to docs-cvs@perl.apache.org by st...@apache.org on 2002/07/31 16:41:50 UTC
cvs commit: modperl-docs/src/docs/general/correct_headers correct_headers.pod

stas        2002/07/31 07:41:49

  Added:       src/docs/general/advocacy advocacy.pod
               src/docs/general/control control.pod
               src/docs/general/correct_headers correct_headers.pod
  Log:
  move pods into their own dirs
  
  Revision  Changes    Path
  1.1                  modperl-docs/src/docs/general/advocacy/advocacy.pod
  
  Index: advocacy.pod
  ===================================================================
  =head1 NAME
  
  mod_perl Advocacy
  
  =head1 Description
  
  Having a hard time getting mod_perl into your organization? We have
  collected some arguments you can use to convince your boss why the
  organization wants mod_perl.
  
  You can contact the L<mod_perl advocacy list|maillist::advocacy>
  if you have any more questions, or good arguments you have used (any
  success-stories are also welcome to L<the docs-dev
  list|maillist::docs-dev>).
  
  Also see L<Popular Perl Complaints and Myths|docs::general::perl_myth::perl_myth>.
  
  =head1 Thoughts about scalability and flexibility
  
  Your need for scalability and flexibility depends on what you need
  from your web site.  If you only want a simple guest book or database
  gateway with no feature headroom, you can get away with any
  EASY_AND_FAST_TO_DEVELOP_TOOL (Exchange, MS IIS, Lotus Notes, etc).
  
  Experience shows that you will soon want more functionality, at which
  point you'll discover the limitations of these "easy" tools.
  Gradually, your boss will ask for increasing functionality and at some
  point you'll realize that the tool lacks flexibility and/or
  scalability.  Then your boss will either buy another
  EASY_AND_FAST_TO_DEVELOP_WITH_TOOLS and repeat the process (with
  different unforseen problems), or you'll start investing time in
  learning how to use a powerful, flexible tool to make the long-term
  development cycle easier.
  
  If you and your company are serious about delivering flexible Internet
  functionality, do your homework.  Then urge your boss to invest a
  little extra time and resources in choosing the right tool for the
  job.  The extra quality and manageability of your site along with your
  ability to deliver new and improved functionality of high quality and
  in good time will prove the superiority of using solid flexible tools.
  
  =head1 The boss, the developer and advocacy
  
  Each developer has a boss who participates in the decision-making
  process.  Remember that the boss considers input from sales people,
  developers, the media and associates before handing down large
  decisions.  Of course, results count!  A sales brochure makes very
  little impact compared to a working demonstration, and demonstrations
  of company-specific and developer-specific results count for a lot!
  
  Personally, when I discovered mod_perl I did a lot of testing and
  coding at home and at work. Once I had a working heavy application, I
  came to my boss with two URLs - one for the plain CGI server and the
  other for the mod_perl-enabled server. It took about 30 secs for my
  boss to say: `Go with it'.  Of course since then I have had to provide
  all the support for other developers, which is why I took time to
  learn it in first place (and why this guide was created!).
  
  Chances are that if you've done your homework, learnt the tools and
  can deliver results, you'll have a successful project.  If you
  convince your boss to try a tool that you don't know very well, your
  results may suffer.  If your boss follows your development process
  closely and sees that your progress is much worse than expected, you
  might be told to "forget it" and mod_perl might not get a second
  chance.
  
  Advocacy is a great thing for the open-source software movement, but
  it's best done quietly until you have confidence that you can show
  productivity.  If you can demonstrate to your boss a heavy CGI which
  is running much faster under mod_perl, that may be a strong argument
  for further evaluation.  Your company may even sponsor a portion of
  your learning process.
  
  Learn the technology by working on sample projects.  Learn how to
  support yourself and learn how to get support from the community; then
  advocate your ideas to your boss.  Then you'll have the knowledge;
  your company will have the benefit; and mod_perl will have the
  reputation it deserves.
  
  =head1 A summary of perl/CGI discussion at slashdot.org
  
  Well, there was a nice discussion of merits of Perl in CGI world. 
  I took the time to summarize this thread, so here is what I've got:
  
  Perl Domination in CGI Programming?
  http://slashdot.org/askslashdot/99/10/20/1246241.shtml
  
  =over 4
  
  =item *
  
  Perl is cool and fun to code with.
  
  =item *
  
  Perl is very fast to develop with.
  
  =item *
  
  Perl is even faster to develop with if you know what CPAN is. :)
  
  =item *
  
  Math intensive code and other stuff which is faster in C/C++, can be
  plugged into Perl with XS/SWIG and may be used transparently by Perl
  programmers.
  
  =item *
  
  Most CGI applications do text processing, at which Perl excels
  
  =item *
  
  Forking and loading (unless the code is shared) of C/C++ CGI programs
  produces an overhead.
  
  =item *
  
  Except for Intranets, bandwidth is usually a bigger bottleneck than 
  Perl performance, although this might change in the future.
  
  =item *
  
  For database driven applications, the database itself is a bottleneck.  
  Lots of posts talk about latency vs throughput.
  
  =item *
  
  mod_perl, FastCGI, Velocigen and PerlEx all give good performance
  gains over plain mod_cgi.
  
  =item *
  
  Other light alternatives to Perl and its derivatives which have
  been mentioned: PHP, Python.
  
  =item *
  
  There were almost no voices from users of M$ and similar technologies,
  I guess that's because they don't read http://slashdot.org :)
  
  =item *
  
  Many said that in many people's minds: 'CGI' eq 'Perl'
  
  =back
  
  =head1 Maintainers
  
  Maintainer is the person(s) you should contact with updates,
  corrections and patches.
  
  =over
  
  =item *
  
  Stas Bekman E<lt>stas (at) stason.orgE<gt>
  
  =back
  
  
  =head1 Authors
  
  =over
  
  =item *
  
  Stas Bekman E<lt>stas (at) stason.orgE<gt>
  
  =back
  
  Only the major authors are listed above. For contributors see the
  Changes file.
  
  
  =cut
  
  
  
  
  1.1                  modperl-docs/src/docs/general/control/control.pod
  
  Index: control.pod
  ===================================================================
  =head1 NAME
  
  Controlling and Monitoring the Server
  
  =head1 Description
  
  Covers techniques to restart mod_perl enabled Apache, SUID scripts,
  monitoring, and other maintenance chores, as well as some specific
  setups.
  
  =head1 Restarting Techniques
  
  All of these techniques require that you know the server process id
  (PID).  The easiest way to find the PID is to look it up in the
  I<httpd.pid> file.  It's easy to discover where to look, by looking in
  the I<httpd.conf> file.  Open the file and locate the entry
  C<PidFile>.  Here is the line from one of my own I<httpd.conf> files:
  
    PidFile /usr/local/var/httpd_perl/run/httpd.pid
  
  As you see, with my configuration the file is
  I</usr/local/var/httpd_perl/run/httpd.pid>.
  
  Another way is to use the C<ps> and C<grep> utilities. Assuming that
  the binary is called I<httpd_perl>, we would do:
  
    % ps auxc | grep httpd_perl
  
  or maybe:
  
    % ps -ef | grep httpd_perl
  
  This will produce a list of all the C<httpd_perl> (parent and
  children) processes.  You are looking for the parent process. If you
  run your server as root, you will easily locate it since it belongs to
  root. If you run the server as some other user (when you L<don't have
  root access|guide::install/Installation_Without_Superuser_Privileges>,
  the processes will belong to that user unless defined differently in
  I<httpd.conf>.  It's still easy to find which is the parent--usually
  it's the process with the smallest PID.
  
  You will see several C<httpd> processes running on your system, but you
  should never need to send signals to any of them except the parent,
  whose pid is in the I<PidFile>.  There are three signals that you can
  send to the parent: C<SIGTERM>, C<SIGHUP>, and C<SIGUSR1>.
  
  Some folks prefer to specify signals using numerical values, rather
  than using symbols.  If you are looking for these, check out your
  C<kill(1)> man page.  My page points to
  I</usr/include/linux/signal.h>, the relevant entries are:
  
    #define SIGHUP     1    /* hangup, generated when terminal disconnects */ 
    #define SIGKILL    9    /* last resort */
    #define SIGTERM   15    /* software termination signal */
    #define SIGUSR1   30    /* user defined signal 1 */
  
  Note that to send these signals from the command line the C<SIG> prefix must
  be omitted and under some operating systems they will need to be preceded by
  a minus sign, e.g. C<kill -15> or C<kill -TERM> followed by the PID.
  
  =head1 Server Stopping and Restarting
  
  We will concentrate here on the implications of sending C<TERM>,
  C<HUP>, and C<USR1> signals (as arguments to kill(1)) to a mod_perl
  enabled server.  See http://www.apache.org/docs/stopping.html for
  documentation on the implications of sending these signals to a plain
  Apache server.
  
  =over 4
  
  =item TERM Signal: Stop Now
  
  Sending the C<TERM> signal to the parent causes it to immediately
  attempt to kill off all its children.  Any requests in progress are
  terminated, and no further requests are served.  This process may take
  quite a few seconds to complete.  To stop a child, the parent sends it
  a C<SIGHUP> signal.  If that fails it sends another.  If that fails it
  sends the C<SIGTERM> signal, and as a last resort it sends the
  C<SIGKILL> signal.  For each failed attempt to kill a child it makes
  an entry in the I<error_log>.
  
  When all the child processes were terminated, the parent itself exits
  and any open log files are closed.  This is when all the accumulated
  C<END> blocks, apart from the ones located in scripts running under
  C<Apache::Registry> or C<Apache::PerlRun> handlers.  In the latter
  case, C<END> blocks are executed after each request is served.
  
  =item HUP Signal: Restart Now
  
  Sending the C<HUP> signal to the parent causes it to kill off its
  children as if the C<TERM> signal had been sent, i.e. any requests in
  progress are terminated; but the parent does not exit.  Instead, the
  parent re-reads its configuration files, spawns a new set of child
  processes and continues to serve requests.  It is almost equivalent to
  stopping and then restarting the server.
  
  If the configuration files contain errors when restart is signaled,
  the parent will exit, so it is important to check the configuration
  files for errors before issuing a restart. How to perform the check
  will be covered shortly;
  
  Sometimes using this approach to restart mod_perl enabled Apache may
  cause the processes memory incremental growth after each restart. This
  happens when Perl code loaded in memory is not completely torn down,
  leading to a memory leak.
  
  =item USR1 Signal: Gracefully Restart Now
  
  The C<USR1> signal causes the parent process to advise the children to
  exit after serving their current requests, or to exit immediately if
  they're not serving a request.  The parent re-reads its configuration
  files and re-opens its log files.  As each child dies off the parent
  replaces it with a child from the new generation (the new children use
  the new configuration) and it begins serving new requests immediately.
  
  The only difference between C<USR1> and C<HUP> is that C<USR1> allows
  the children to complete any current requests prior to killing them
  off and there is no interruption in the services compared to the
  killing with C<HUP> signal, where it might take a few seconds for a
  restart to get completed and there is no real service at this time.
  
  =back
  
  By default, if a server is restarted (using C<kill -USR1 `cat
  logs/httpd.pid`> or with the C<HUP> signal), Perl scripts and modules
  are not reloaded.  To reload C<PerlRequire>s, C<PerlModule>s, other
  C<use()>'d modules and flush the C<Apache::Registry> cache, use this
  directive in I<httpd.conf>:
  
    PerlFreshRestart On
  
  Make sure you read L<Evil things might happen when using
  PerlFreshRestart|guide::troubleshooting/Evil_things_might_happen_when_using_PerlFreshRestart>.
  
  =head1 Speeding up the Apache Termination and Restart
  
  We've already mentioned that restart or termination can sometimes take
  quite a long time, (e.g. tens of seconds), for a mod_perl server.  The
  reason for that is a call to the C<perl_destruct()> Perl API function
  during the child exit phase.  This will cause proper execution of
  C<END> blocks found during server startup and will invoke the
  C<DESTROY> method on global objects which are still alive.
  
  It is also possible that this operation may take a long time to
  finish, causing a long delay during a restart.  Sometimes this will be
  followed by a series of messages appearing in the server I<error_log>
  file, warning that certain child processes did not exit as expected.
  This happens when after a few attempts advising the child process to
  quit, the child is still in the middle of perl_destruct(), and a
  lethal C<KILL> signal is sent, aborting any operation the child has
  happened to execute and I<brutally> killing it.
  
  If your code does not contain any C<END> blocks or C<DESTROY> methods
  which need to be run during child server shutdown, or may have these,
  but it's insignificant to execute them, this destruction can be
  avoided by setting the C<PERL_DESTRUCT_LEVEL> environment variable to
  C<-1>. For example add this setting to the I<httpd.conf> file:
  
   PerlSetEnv PERL_DESTRUCT_LEVEL -1
  
  What constitutes a significant cleanup?  Any change of state outside
  of the current process that would not be handled by the operating
  system itself.  So committing database transactions and removing the
  lock on some resource are significant operations, but closing an
  ordinary file isn't.
  
  =head1 Using apachectl to Control the Server
  
  The Apache distribution comes with a script to control the server.
  It's called C<apachectl> and it is installed into the same location as
  the httpd executable.  We will assume for the sake of our examples
  that it's in C</usr/local/sbin/httpd_perl/apachectl>:
  
  To start httpd_perl:
  
    % /usr/local/sbin/httpd_perl/apachectl start 
  
  To stop httpd_perl:
  
    % /usr/local/sbin/httpd_perl/apachectl stop
  
  To restart httpd_perl (if it is running, send C<SIGHUP>; if it is not
  already running just start it):
  
    % /usr/local/sbin/httpd_perl/apachectl restart
  
  Do a graceful restart by sending a C<SIGUSR1>, or start if not
  running:
  
    % /usr/local/sbin/httpd_perl/apachectl graceful
  
  To do a configuration test:
  
    % /usr/local/sbin/httpd_perl/apachectl configtest 
  
  Replace C<httpd_perl> with C<httpd_docs> in the above calls to control
  the C<httpd_docs> server.
  
  There are other options for C<apachectl>, use the C<help> option to
  see them all.
  
  It's important to remember that C<apachectl> uses the PID file, which
  is specified by the C<PIDFILE> directive in I<httpd.conf>.  If you
  delete the PID file by hand while the server is running, C<apachectl>
  will be unable to stop or restart the server.
  
  =head1 Safe Code Updates on a Live Production Server
  
  You have prepared a new version of code, uploaded it into a production
  server, restarted it and it doesn't work.  What could be worse than
  that?  You also cannot go back, because you have overwritten the good
  working code.
  
  It's quite easy to prevent it, just don't overwrite the previous working
  files!
  
  Personally I do all updates on the live server with the following
  sequence.  Assume that the server root directory is
  I</home/httpd/perl/rel>.  When I'm about to update the files I create
  a new directory I</home/httpd/perl/beta>, copy the old files from
  I</home/httpd/perl/rel> and update it with the new files.  Then I do
  some last sanity checks (check file permissions are [read+executable],
  and run C<perl -c> on the new modules to make sure there no errors in
  them).  When I think I'm ready I do:
  
    % cd /home/httpd/perl
    % mv rel old && mv beta rel && stop && sleep 3 && restart && err
  
  Let me explain what this does.
  
  Firstly, note that I put all the commands on one line, separated by
  C<&&>, and only then press the C<Enter> key.  As I am working
  remotely, this ensures that if I suddenly lose my connection (sadly
  this happens sometimes) I won't leave the server down if only the
  C<stop> command squeezed in.  C<&&> also ensures that if any command
  fails, the rest won't be executed.  I am using aliases (which I have
  already defined) to make the typing easier:
  
    % alias | grep apachectl
    graceful /usr/local/apache/bin/apachectl graceful
    rehup   /usr/local/apache/sbin/apachectl restart
    restart /usr/local/apache/bin/apachectl restart
    start   /usr/local/apache/bin/apachectl start
    stop    /usr/local/apache/bin/apachectl stop
  
    % alias err
    tail -f /usr/local/apache/logs/error_log
  
  Taking the line apart piece by piece:
  
    mv rel old &&
  
  back up the working directory to I<old>
  
    mv beta rel &&
  
  put the new one in its place
  
    stop &&
  
  stop the server
  
    sleep 3 &&
  
  give it a few seconds to shut down (it might take even longer)
  
    restart &&
  
  C<restart> the server
  
    err
  
  view of the tail of the I<error_log> file in order to see that
  everything is OK
  
  C<apachectl> generates the status messages a little too early
  (e.g. when you issue C<apachectl stop> it says the server has been
  stopped, while in fact it's still running) so don't rely on it, rely
  on the C<error_log> file instead.
  
  Also notice that I use C<restart> and not just C<start>.  I do this
  because of Apache's potentially long stopping times (it depends on
  what you do with it of course!).  If you use C<start> and Apache
  hasn't yet released the port it's listening to, the start would fail
  and C<error_log> would tell you that the port is in use, e.g.:
  
    Address already in use: make_sock: could not bind to port 8080
  
  But if you use C<restart>, it will wait for the server to quit and
  then will cleanly restart it.
  
  Now what happens if the new modules are broken?  First of all, I see
  immediately an indication of the problems reported in the C<error_log>
  file, which I C<tail -f> immediately after a restart command.  If
  there's a problem, I just put everything back as it was before:
  
    % mv rel bad && mv old rel && stop && sleep 3 && restart && err
  
  Usually everything will be fine, and I have had only about 10 seconds
  of downtime, which is pretty good!
  
  =head1 An Intentional Disabling of Live Scripts
  
  What happens if you really must take down the server or disable the
  scripts?  This situation might happen when you need to do some
  maintenance work on your database server.  If you have to take your
  database down then any scripts that use it will fail.
  
  If you do nothing, the user will see either the grey C<An Error has
  happened> message or perhaps a customized error message if you have
  added code to trap and customize the errors.  See L<Redirecting Errors
  to the Client instead of to the
  error_log|guide::snippets/Redirecting_Errors_to_the_Client_Instead_of_error_log>
  for the latter case.
  
  A much friendlier approach is to confess to your users that you are
  doing some maintenance work and plead for patience, promising (keep
  the promise!) that the service will become fully functional in X
  minutes.  There are a few ways to do this:
  
  The first doesn't require messing with the server.  It works when you
  have to disable a script running under C<Apache::Registry> and relies
  on the fact that it checks whether the file was modified before using
  the cached version.  Obviously it won't work under other handlers
  because these serve the compiled version of the code and don't check
  to see if there was a change in the code on the disk.
  
  So if you want to disable an C<Apache::Registry> script, prepare a
  little script like this:
  
    /home/http/perl/maintenance.pl
    ----------------------------
    #!/usr/bin/perl -Tw
    
    use strict;
    use CGI;
    my $q = new CGI;
    print $q->header, $q->p(
    "Sorry, the service is temporarily down for maintenance. 
     It will be back in ten to fifteen minutes.
     Please, bear with us.
     Thank you!");
  
  So if you now have to disable a script for example
  C</home/http/perl/chat.pl>, just do this:
  
    % mv /home/http/perl/chat.pl /home/http/perl/chat.pl.orig
    % ln -s /home/http/perl/maintenance.pl /home/http/perl/chat.pl
  
  Of course you server configuration should allow symbolic links for
  this trick to work.  Make sure you have the directive
  
    Options FollowSymLinks
  
  in the C<E<lt>LocationE<gt>> or C<E<lt>DirectoryE<gt>> section of your
  I<httpd.conf>.
  
  When you're done, it's easy to restore the previous setup.  Just do
  this:
  
    % mv /home/http/perl/chat.pl.orig /home/http/perl/chat.pl
  
  which overwrites the symbolic link.
  
  Now make sure that the script will have the current timestamp:
  
    % touch /home/http/perl/chat.pl
  
  Apache will automatically detect the change and will use the moved
  script instead.
  
  The second approach is to change the server configuration and
  configure a whole directory to be handled by a C<My::Maintenance>
  handler (which you must write).  For example if you write something
  like this:
  
    My/Maintenance.pm
    ------------------
    package My::Maintenance;
    use strict;
    use Apache::Constants qw(:common);
    sub handler {
      my $r = shift;
      print $r->send_http_header("text/plain");
      print qq{
        We apologize, but this service is temporarily stopped for
        maintenance.  It will be back in ten to fifteen minutes.  
        Please, bear with us.  Thank you!
      };
      return OK;
    }
    1;
  
  and put it in a directory that is in the server's C<@INC>, to disable all
  the scripts in Location C</perl> you would replace:
  
    <Location /perl>
      SetHandler perl-script
      PerlHandler My::Handler
      [snip]
    </Location>
  
  with
  
    <Location /perl>
      SetHandler perl-script
      PerlHandler My::Maintenance
      [snip]
    </Location>
  
  Now restart the server.  Your users will be happy to go and read
  http://slashdot.org for ten minutes, knowing that you are working on a
  much better version of the service.
  
  If you need to disable a location handled by some module, the second
  approach would work just as well.
  
  =head1 SUID Start-up Scripts
  
  If you want to allow a few people in your team to start and stop the
  server you will have to give them the root password, which is not a
  good thing to do. The less people know the password, the less problems
  are likely to be encountered.  But there is an easy solution for this
  problem available on UNIX platforms.  It's called a setuid executable.
  
  =head2 Introduction to SUID Executables
  
  The setuid executable has a setuid permissions bit set. This sets the
  process's effective user ID to that of the file upon execution. You
  perform this setting with the following command:
  
    % chmod u+s filename
  
  You probably have used setuid executables before without even knowing
  about it. For example when you change your password you execute the
  C<passwd> utility, which among other things modifies the
  I</etc/passwd> file. In order to change this file you need root
  permissions, the C<passwd> utility has the setuid bit set, therefore
  when you execute this utility, its effective ID is the same of the
  root user ID.
  
  You should avoid using setuid executables as a general practice. The
  less setuid executables you have the less likely that someone will
  find a way to break into your system, by exploiting some bug you
  didn't know about.
  
  When the executable is setuid to root, you have to make sure that it
  doesn't have the group and world read and write permissions. If we
  take a look at the C<passwd> utility we will see:
  
    % ls -l /usr/bin/passwd
    -r-s--x--x 1 root root 12244 Feb 8 00:20 /usr/bin/passwd
  
  You achieve this with the following command:
  
    % chmod 4511 filename
  
  The first digit (4) stands for setuid bit, the second digit (5) is a
  compound of read (4) and executable (1) permissions for the user, and
  the third and the fourth digits are setting the executable permissions
  for the group and the world.
  
  =head2 Apache Startup SUID Script's Security
  
  In our case, we want to allow setuid access only to a specific group
  of users, who all belong to the same group. For the sake of our
  example we will use the group named I<apache>. It's important that
  users who aren't root or who don't belong to the I<apache> group will
  not be able to execute this script. Therefore we perform the following
  commands:
  
    % chgrp apache apachectl
    % chmod  4510  apachectl
  
  The execution order is important. If you swap the command execution
  order you will lose the setuid bit.
  
  Now if we look at the file we see:
  
    % ls -l apachectl
    -r-s--x--- 1 root apache 32 May 13 21:52 apachectl
  
  Now we are all set... Almost...
  
  When you start Apache, Apache and Perl modules are being loaded, code
  can be executed. Since all this happens with root effective ID, any
  code executed as if the root user was doing that. You should be very
  careful because while you didn't gave anyone the root password, all
  the users in the I<apache> group have an indirect root access. Which
  means that if Apache loads some module or executes some code that is
  writable by some of these users, users can plant code that will allow
  them to gain a shell access to root account and become a real root.
  
  Of course if you don't trust your team you shouldn't use this solution
  in first place. You can try to check that all the files Apache loads
  aren't writable by anyone but root, but there are too many of them,
  especially in the mod_perl case, where many Perl modules are loaded at
  the server startup.
  
  By the way, don't let all this setuid stuff to confuse you -- when the
  parent process is loaded, the children processes are spawned as
  non-root processes. This section has presented a way to allow non-root
  users to start the server as root user, the rest is exactly the same
  as if you were executing the script as root in first place.
  
  =head2 Sample Apache Startup SUID Script
  
  Now if you are still with us, here is an example of the setuid Apache
  startup script.
  
  Note the line marked C<WORKAROUND>, which fixes an obscure error when
  starting mod_perl enabled Apache by setting the real UID to the
  effective UID.  Without this workaround, a mismatch between the real
  and the effective UID causes Perl to croak on the C<-e> switch.
  
  Note that you must be using a version of Perl that recognizes and
  emulates the suid bits in order for this to work.  This script will do
  different things depending on whether it is named C<start_httpd>,
  C<stop_httpd> or C<restart_httpd>.  You can use symbolic links for
  this purpose.
  
    suid_apache_ctl
    ---------------
    #!/usr/bin/perl -T
     
    # These constants will need to be adjusted.
    $PID_FILE = '/home/www/logs/httpd.pid';
    $HTTPD = '/home/www/httpd -d /home/www';
    
    # These prevent taint warnings while running suid
    $ENV{PATH}='/bin:/usr/bin';
    $ENV{IFS}='';
    
    # This sets the real to the effective ID, and prevents
    # an obscure error when starting apache/mod_perl
    $< = $>; # WORKAROUND
    $( = $) = 0; # set the group to root too
    
    # Do different things depending on our name
    ($name) = $0 =~ m|([^/]+)$|;
    
    if ($name eq 'start_httpd') {
        system $HTTPD and die "Unable to start HTTP";
        print "HTTP started.\n";
        exit 0;
    }
    
    # extract the process id and confirm that it is numeric
    $pid = `cat $PID_FILE`;
    $pid =~ /(\d+)/ or die "PID $pid not numeric";
    $pid = $1;
    
    if ($name eq 'stop_httpd') {
        kill 'TERM',$pid or die "Unable to signal HTTP";
        print "HTTP stopped.\n";
        exit 0;
    }
    
    if ($name eq 'restart_httpd') {
        kill 'HUP',$pid or die "Unable to signal HTTP";
        print "HTTP restarted.\n";
        exit 0;
    }
    
    die "Script must be named start_httpd, stop_httpd, or restart_httpd.\n";
  
  =head1 Preparing for Machine Reboot
  
  When you run your own development box, it's okay to start the
  webserver by hand when you need to.  On a production system it is
  possible that the machine the server is running on will have to be
  rebooted.  When the reboot is completed, who is going to remember to
  start the server?  It's easy to forget this task, and what happens if
  you aren't around when the machine is rebooted?
  
  After the server installation is complete, it's important not to
  forget that you need to put a script to perform the server startup and
  shutdown into the standard system location, for example I</etc/rc.d>
  under RedHat Linux, or I</etc/init.d/apache> under Debian Slink Linux.
  
  This is the directory which contains scripts to start and stop all the
  other daemons.  The directory and file names vary from one Operating
  System (OS) to another, and even between different distributions of
  the same OS.
  
  Generally the simplest solution is to copy the C<apachectl> script to
  your startup directory or create a symbolic link from the startup
  directory to the C<apachectl> script.  You will find C<apachectl> in
  the same directory as the httpd executable after Apache installation.
  If you have more than one Apache server you will need a separate
  script for each one, and of course you will have to rename them so
  that they can co-exist in the same directories.
  
  For example on a RedHat Linux machine with two servers, I have the
  following setup:
  
    /etc/rc.d/init.d/httpd_docs
    /etc/rc.d/init.d/httpd_perl
    /etc/rc.d/rc3.d/S91httpd_docs -> ../init.d/httpd_docs
    /etc/rc.d/rc3.d/S91httpd_perl -> ../init.d/httpd_perl
    /etc/rc.d/rc6.d/K16httpd_docs -> ../init.d/httpd_docs
    /etc/rc.d/rc6.d/K16httpd_perl -> ../init.d/httpd_perl
  
  The scripts themselves reside in the I</etc/rc.d/init.d> directory.
  There are symbolic links to these scripts in other directories. The
  names are the same as the script names but they have numerical
  prefixes, which are used for executing the scripts in a particular
  order: the lower numbers are executed earlier.
  
  When the system starts (level 3) we want the Apache to be started when
  almost all of the services are running already, therefore I've used
  I<S91>. For example if the mod_perl enabled Apache issues a
  C<connect_on_init()> the SQL server should be started before Apache.
  
  When the system shuts down (level 6), Apache should be stopped as one
  of the first processes, therefore I've used C<K16>. Again if the server
  does some cleanup processing during the shutdown event and requires
  third party services to be running (e.g. SQL server) it should be
  stopped before these services.
  
  Notice that it's normal for more than one symbolic link to have the
  same sequence number.
  
  Under RedHat Linux and similar systems, when a machine is booted and
  its runlevel set to 3 (multiuser + network), Linux goes into
  I</etc/rc.d/rc3.d/> and executes the scripts the symbolic links point
  to with the C<start> argument.  When it sees I<S91httpd_perl>, it
  executes:
  
    /etc/rc.d/init.d/httpd_perl start
  
  When the machine is shut down, the scripts are executed through links
  from the I</etc/rc.d/rc6.d/> directory.  This time the scripts are
  called with the C<stop> argument, like this:
  
    /etc/rc.d/init.d/httpd_perl stop
  
  Most systems have GUI utilities to automate the creation of symbolic
  links.  For example RedHat Linux includes the C<control-panel>
  utility, which amongst other things includes the C<RunLevel Manager>.
  (which can be invoked directly as either ntsysv(8) or tksysv(8)).
  This will help you to create the proper symbolic links.  Of course
  before you use it, you should put C<apachectl> or similar scripts into
  the I<init.d> or equivalent directory. Or you can have a symbolic link
  to some other location instead.
  
  The simplest approach is to use the chkconfig(8) utility which adds
  and removes the services for you. The following example shows how to
  add an I<httpd_perl> startup script to the system.
  
  First move or copy the file into the directory I</etc/rc.d/init.d>:
  
    % mv httpd_perl /etc/rc.d/init.d
  
  Now open the script in your favorite editor and add the following
  lines after the main header of the script:
  
    # Comments to support chkconfig on RedHat Linux
    # chkconfig: 2345 91 16
    # description: mod_perl enabled Apache Server
  
  So now the beginning of the script looks like:
  
    #!/bin/sh
    #
    # Apache control script designed to allow an easy command line
    # interface to controlling Apache.  Written by Marc Slemko,
    # 1997/08/23
    
    # Comments to support chkconfig on RedHat Linux
    # chkconfig: 2345 91 16
    # description: mod_perl enabled Apache Server
    
    #
    # The exit codes returned are:
    # ...
  
  Adjust the line:
  
    # chkconfig: 2345 91 16
  
  to your needs. The above setting says to says that the script should
  be started in levels 2, 3, 4, and 5, that its start priority should be
  91, and that its stop priority should be 16.
  
  Now all you have to do is to ask C<chkconfig> to configure the startup
  scripts. Before we do that let's look at what we have:
  
    % find /etc/rc.d | grep httpd_perl
    
    /etc/rc.d/init.d/httpd_perl
  
  Which means that we only have the startup script itself. Now we
  execute:
  
    % chkconfig --add httpd_perl
  
  and see what has changed:
  
    % find /etc/rc.d | grep httpd_perl
    
    /etc/rc.d/init.d/httpd_perl
    /etc/rc.d/rc0.d/K16httpd_perl
    /etc/rc.d/rc1.d/K16httpd_perl
    /etc/rc.d/rc2.d/S91httpd_perl
    /etc/rc.d/rc3.d/S91httpd_perl
    /etc/rc.d/rc4.d/S91httpd_perl
    /etc/rc.d/rc5.d/S91httpd_perl
    /etc/rc.d/rc6.d/K16httpd_perl
  
  As you can see C<chkconfig> created all the symbolic links for us,
  using the startup and shutdown priorities as specified in the line:
  
    # chkconfig: 2345 91 16
  
  If for some reason you want to remove the service from the startup
  scripts, all you have to do is to tell C<chkconfig> to remove the
  links:
  
    % chkconfig --del httpd_perl
  
  Now if we look at the files under the directory I</etc/rc.d/> we see
  again only the script itself.
  
    % find /etc/rc.d | grep httpd_perl
    
    /etc/rc.d/init.d/httpd_perl
  
  Of course you may keep the startup script in any other directory as
  long as you can link to it. For example if you want to keep this file
  with all the Apache binaries in I</usr/local/apache/bin>, all you have
  to do is to provide a symbolic link to this file:
  
    % ln -s /usr/local/apache/bin/apachectl /etc/rc.d/init.d/httpd_perl
  
  and then:
  
    %  chkconfig --add httpd_perl
  
  Note that in case of using symlinks the link name in
  I</etc/rc.d/init.d> is what matters and not the name of the script the
  link points to.
  
  =head1 Monitoring the Server.  A watchdog.
  
  With mod_perl many things can happen to your server.  It is possible
  that the server might die when you are not around.  As with any other
  critical service you need to run some kind of watchdog.
  
  One simple solution is to use a slightly modified C<apachectl> script,
  which I've named I<apache.watchdog>.  Call it from the crontab every
  30 minutes -- or even every minute -- to make sure the server is up
  all the time.
  
  The crontab entry for 30 minutes intervals:
  
    0,30 * * * * /path/to/the/apache.watchdog >/dev/null 2>&1
  
  The script:
  
    #!/bin/sh
      
    # this script is a watchdog to see whether the server is online
    # It tries to restart the server, and if it's
    # down it sends an email alert to admin 
    
    # admin's email
    EMAIL=webmaster@example.com
      
    # the path to your PID file
    PIDFILE=/usr/local/var/httpd_perl/run/httpd.pid
      
    # the path to your httpd binary, including options if necessary
    HTTPD=/usr/local/sbin/httpd_perl/httpd_perl
          
    # check for pidfile
    if [ -f $PIDFILE ] ; then
      PID=`cat $PIDFILE`
      
      if kill -0 $PID; then
        STATUS="httpd (pid $PID) running"
        RUNNING=1
      else
        STATUS="httpd (pid $PID?) not running"
        RUNNING=0
      fi
    else
      STATUS="httpd (no pid file) not running"
      RUNNING=0
    fi
        
    if [ $RUNNING -eq 0 ]; then
      echo "$0 $ARG: httpd not running, trying to start"
      if $HTTPD ; then
        echo "$0 $ARG: httpd started"
        mail $EMAIL -s "$0 $ARG: httpd started" > /dev/null 2>&1
      else
        echo "$0 $ARG: httpd could not be started"
        mail $EMAIL -s \
        "$0 $ARG: httpd could not be started" > /dev/null 2>&1
      fi
    fi
  
  Another approach, probably even more practical, is to use the cool
  C<LWP> Perl package to test the server by trying to fetch some
  document (script) served by the server.  Why is it more practical?
  Because while the server can be up as a process, it can be stuck and
  not working.  Failing to get the document will trigger restart, and
  "probably" the problem will go away. 
  
  Like before we set a cronjob to call this script every few minutes to
  fetch some very light script.  The best thing of course is to call it
  every minute.  Why so often?  If your server starts to spin and trash
  your disk space with multiple error messages filling the I<error_log>,
  in five minutes you might run out of free disk space which might bring
  your system to its knees.  Chances are that no other child will be
  able to serve requests, since the system will be too busy writing to
  the I<error_log> file.  Think big--if you are running a heavy service
  (which is very fast since you are running under mod_perl) adding one
  more request every minute will not be felt by the server at all.
  
  So we end up with a crontab entry like this:
  
    * * * * * /path/to/the/watchdog.pl >/dev/null 2>&1
  
  And the watchdog itself:
  
    #!/usr/bin/perl -wT
    
    # untaint
    $ENV{'PATH'} = '/bin:/usr/bin';
    delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'};
    
    use strict;
    use diagnostics;
    use URI::URL;
    use LWP::MediaTypes qw(media_suffix);
    
    my $VERSION = '0.01';
    use vars qw($ua $proxy);
    $proxy = '';    
    
    require LWP::UserAgent;
    use HTTP::Status;
    
    ###### Config ########
    my $test_script_url = 'http://www.example.com:81/perl/test.pl';
    my $monitor_email   = 'root@localhost';
    my $restart_command = '/usr/local/sbin/httpd_perl/apachectl restart';
    my $mail_program    = '/usr/lib/sendmail -t -n';
    ######################
    
    $ua  = new LWP::UserAgent;
    $ua->agent("$0/watchdog " . $ua->agent);
    # Uncomment the proxy if you access a machine from behind a firewall
    # $proxy = "http://www-proxy.com";
    $ua->proxy('http', $proxy) if $proxy;
    
    # If it returns '1' it means we are alive
    exit 1 if checkurl($test_script_url);
    
    # Houston, we have a problem.
    # The server seems to be down, try to restart it. 
    my $status = system $restart_command;
    
    my $message = ($status == 0) 
                ? "Server was down and successfully restarted!" 
                : "Server is down. Can't restart.";
      
    my $subject = ($status == 0) 
                ? "Attention! Webserver restarted"
                : "Attention! Webserver is down. can't restart";
    
    # email the monitoring person
    my $to = $monitor_email;
    my $from = $monitor_email;
    send_mail($from,$to,$subject,$message);
    
    # input:  URL to check 
    # output: 1 for success, 0 for failure
    #######################  
    sub checkurl{
      my ($url) = @_;
    
      # Fetch document 
      my $res = $ua->request(HTTP::Request->new(GET => $url));
    
      # Check the result status
      return 1 if is_success($res->code);
    
      # failed
      return 0;
    } #  end of sub checkurl
    
    # send email about the problem 
    #######################  
    sub send_mail{
      my($from,$to,$subject,$messagebody) = @_;
    
      open MAIL, "|$mail_program"
          or die "Can't open a pipe to a $mail_program :$!\n";
     
      print MAIL <<__END_OF_MAIL__;
    To: $to
    From: $from
    Subject: $subject
    
    $messagebody
    
    __END_OF_MAIL__
    
      close MAIL;
    } 
  
  =head1 Running a Server in Single Process Mode
  
  Often while developing new code, you will want to run the server in
  single process mode.  See L<Sometimes it works Sometimes it does
  Not|guide::porting/Sometimes_it_Works__Sometimes_it_Doesn_t> and 
  L<Names collisions with Modules and
  libs|guide::porting/Name_collisions_with_Modules_and_libs>.  Running in
  single process mode inhibits the server from "daemonizing", and this
  allows you to run it under the control of a debugger more easily.
  
    % /usr/local/sbin/httpd_perl/httpd_perl -X
  
  When you use the C<-X> switch the server will run in the foreground of
  the shell, so you can kill it with I<Ctrl-C>.
  
  Note that in C<-X> (single-process) mode the server will run very
  slowly when fetching images.
  
  Note for Netscape users:
  
  If you use Netscape while your server is running in single-process
  mode, HTTP's C<KeepAlive> feature gets in the way.  Netscape tries to
  open multiple connections and keep them open.  Because there is only
  one server process listening, each connection has to time out before
  the next succeeds.  Turn off C<KeepAlive> in I<httpd.conf> to avoid
  this effect while developing.  If you use the image size parameters,
  Netscape will be able to render the page without the images so you can
  press the browser's I<STOP> button after a few seconds.
  
  In addition you should know that when running with C<-X> you will not
  see the control messages that the parent server normally writes to the
  I<error_log> (I<"server started">, I<"server stopped"> etc).  Since
  C<httpd -X> causes the server to handle all requests itself, without
  forking any children, there is no controlling parent to write the
  status messages.
  
  =head1 Starting a Personal Server for Each Developer
  
  If you are the only developer working on the specific server:port you
  have no problems, since you have complete control over the server.
  However, often you will have a group of developers who need to develop
  mod_perl scripts and modules concurrently.  This means that each
  developer will want to have control over the server - to kill it, to
  run it in single server mode, to restart it, etc., as well as having
  control over the location of the log files, configuration settings
  like C<MaxClients>, and so on.
  
  You I<can> work around this problem by preparing a few I<httpd.conf>
  files and forcing each developer to use
  
    httpd_perl -f /path/to/httpd.conf  
  
  but I approach it in a different way.  I use the C<-Dparameter>
  startup option of the server.  I call my version of the server
  
    % http_perl -Dstas
  
  In I<httpd.conf> I write:
  
    # Personal development Server for stas
    # stas uses the server running on port 8000
    <IfDefine stas>
    Port 8000
    PidFile /usr/local/var/httpd_perl/run/httpd.pid.stas
    ErrorLog /usr/local/var/httpd_perl/logs/error_log.stas
    Timeout 300
    KeepAlive On
    MinSpareServers 2
    MaxSpareServers 2
    StartServers 1
    MaxClients 3
    MaxRequestsPerChild 15
    </IfDefine>
    
    # Personal development Server for userfoo
    # userfoo uses the server running on port 8001
    <IfDefine userfoo>
    Port 8001
    PidFile /usr/local/var/httpd_perl/run/httpd.pid.userfoo
    ErrorLog /usr/local/var/httpd_perl/logs/error_log.userfoo
    Timeout 300
    KeepAlive Off
    MinSpareServers 1
    MaxSpareServers 2
    StartServers 1
    MaxClients 5
    MaxRequestsPerChild 0
    </IfDefine>
  
  With this technique we have achieved full control over start/stop,
  number of children, a separate error log file, and port selection for
  each server.  This saves Stas from getting called every few minutes by
  Eric: "Stas, I'm going to restart the server".
  
  In the above technique, you need to discover the PID of your parent
  C<httpd_perl> process, which is written in
  C</usr/local/var/httpd_perl/run/httpd.pid.stas> (and the same for the
  user I<eric>).  To make things even easier we change the I<apachectl>
  script to do the work for us.  We make a copy for each developer
  called B<apachectl.username> and we change two lines in each script:
  
    PIDFILE=/usr/local/var/httpd_perl/run/httpd.pid.username
    HTTPD='/usr/local/sbin/httpd_perl/httpd_perl -Dusername'
  
  So for the user I<stas> we prepare a startup script called
  I<apachectl.stas> and we change these two lines in the standard
  apachectl script as it comes unmodified from Apache distribution.
  
    PIDFILE=/usr/local/var/httpd_perl/run/httpd.pid.stas
    HTTPD='/usr/local/sbin/httpd_perl/httpd_perl -Dstas'
  
  So now when user I<stas> wants to stop the server he will execute:
  
    apachectl.stas stop
  
  And to start:
  
    apachectl.stas start
  
  Certainly the rest of the C<apachectl> arguments apply as before.
  
  You might think about having only one C<apachectl> and know who is
  calling by checking the UID, but since you have to be root to start
  the server it is not possible, unless you make the setuid bit on this
  script, as we've explained in the beginning of this chapter. If you do
  so, you can have a single C<apachectl> script for all developers,
  after you modify it to automatically find out the UID of the user, who
  executes the script and set the right paths.
  
  The last thing is to provide developers with an option to run in
  single process mode by:
  
    /usr/local/sbin/httpd_perl/httpd_perl -Dstas -X
  
  In addition to making life easier, we decided to use relative links
  everywhere in the static documents, including the calls to CGIs.  You
  may ask how using relative links will get to the right server port.
  It's very simple, we use C<mod_rewrite>.
  
  To use mod_rewrite you have to configure your I<httpd_docs> server
  with C<--enable-module=rewrite> and recompile, or use DSO and load the
  module in I<httpd.conf>.  In the I<httpd.conf> of our C<httpd_docs>
  server we have the following code:
  
    RewriteEngine on
    
    # stas's server
    # port = 8000
    RewriteCond  %{REQUEST_URI} ^/(perl|cgi-perl)
    RewriteCond  %{REMOTE_ADDR} 123.34.45.56
    RewriteRule ^(.*)           http://example.com:8000/$1 [P,L]
    
    # eric's server
    # port = 8001
    RewriteCond  %{REQUEST_URI} ^/(perl|cgi-perl)
    RewriteCond  %{REMOTE_ADDR} 123.34.45.57
    RewriteRule ^(.*)           http://example.com:8001/$1 [P,L]
    
    # all the rest
    RewriteCond  %{REQUEST_URI} ^/(perl|cgi-perl)
    RewriteRule ^(.*)           http://example.com:81/$1 [P]
  
  The IP addresses are the addresses of the developer desktop machines
  (where they are running their web browsers).  So if an html file
  includes a relative URI I</perl/test.pl> or even
  I<http://www.example.com/perl/test.pl>, clicking on the link will be
  internally proxied to http://www.example.com:8000/perl/test.pl if the
  click has been made at the user I<stas>'s desktop machine, or to
  I<http://www.example.com:8001/perl/test.pl> for a request generated
  from the user I<eric>'s machine, per our above URI rewrite example.
  
  Another possibility is to use C<REMOTE_USER> variable if all the
  developers are forced to authenticate themselves before they can
  access the server. If you do, you will have to change the
  C<RewriteRule>s to match C<REMOTE_USER> in the above example.
  
  We wish to stress again, that the above setup will work only with
  relative URIs in the HTML code. If you choose to generate full URIs
  including non-80 port the requests originated from this HTML code will
  bypass the light server listening to the default port 80, and go
  directly to the I<server:port> of the full URI.
  
  =head1 Wrapper to Emulate the Server Perl Environment
  
  Often you will start off debugging your script by running it from your
  favorite shell program.  Sometimes you encounter a very weird
  situation when the script runs from the shell but dies when processed
  as a CGI script by a web-server.  The real problem often lies in the
  difference between the environment variables that is used by your
  web-server and the ones used by your shell program.
  
  For example you may have a set of non-standard Perl directories, used
  for local Perl modules. You have to tell the Perl interpreter where
  these directories are. If you don't want to modify C<@INC> in all
  scripts and modules, you can use a C<PERL5LIB> environment variable,
  to tell Perl where the directories are. But then you might forget to
  alter the mod_perl startup script to correct C<@INC> there as
  well. And if you forget this, you can be quite puzzled why the scripts
  are running from the shell program, but not from the web. 
  
  Of course the I<error_log> will help as well to find out what the
  problem is, but there can be other obscure cases, where you do
  something different at the shell program and your scripts refuse to
  run under the web-server.
  
  Another example is when you have more than one version of Perl
  installed. You might call the first version of the Perl executable in
  the first script's line (the shebang line), but to have the web-server
  compiled with another Perl version. Since mod_perl ignores the path to
  the Perl executable at the first line of the script, you can get quite
  confused the code won't do the same when processed as request,
  compared to be executed from the command line. it will take a while
  before you realize that you test the scripts from the shell program
  using the I<wrong> Perl version.
  
  The best debugging approach is to write a wrapper that emulates the
  exact environment of the server, first deleting environment variables
  like C<PERL5LIB> and then calling the same perl binary that it is
  being used by the server.  Next, set the environment identical to the
  server's by copying the Perl run directives from the server startup
  and configuration files or even I<require()>'ing the startup file, if
  it doesn't include C<Apache::> modules stuff, unavailable under shell.
  This will also allow you to remove completely the first line of the
  script, since mod_perl doesn't need it anyway and the wrapper knows
  how to call the script.
  
  Here is an example of such a script.  Note that we force the use of
  C<-Tw> when we call the real script. Since when debugging we want to
  make sure that the code is working when the taint mode is on, and we
  want to see all the warnings, to help Perl help us have a better code.
  
  We have also added the ability to pass parameters, which will not
  happen when you will issue a request to script, but it can be helpful
  at times.
  
    #!/usr/bin/perl -w
     
    # This is a wrapper example
     
    # It simulates the web server environment by setting @INC and other
    # stuff, so what will run under this wrapper will run under Web and
    # vice versa. 
    
    #
    # Usage: wrap.pl some_cgi.pl
    #
    BEGIN {
      # we want to make a complete emulation, so we must reset all the
      # paths and add the standard Perl libs
      @INC =
        qw(/usr/lib/perl5/5.00503/i386-linux
       /usr/lib/perl5/5.00503
       /usr/lib/perl5/site_perl/5.005/i386-linux
       /usr/lib/perl5/site_perl/5.005
       .
      );
    }
    
    use strict;
    use File::Basename;
    
      # process the passed params
    my $cgi = shift || '';
    my $params = (@ARGV) ? join(" ", @ARGV) : '';
    
    die "Usage:\n\t$0 some_cgi.pl\n" unless $cgi;
    
      # Set the environment
    my $PERL5LIB = join ":", @INC;
    
      # if the path includes the directory 
      # we extract it and chdir there
    if (index($cgi,'/') >= 0) {
      my $dirname = dirname($cgi);
      chdir $dirname or die "Can't chdir to $dirname: $! \n";
      $cgi =~ m|$dirname/(.*)|;
      $cgi = $1;
    }
    
      # run the cgi from the script's directory
      # Note that we set Warning and Taint modes ON!!!
    system qq{/usr/bin/perl -I$PERL5LIB -Tw $cgi $params};
  
  =head1 Server Maintenance Chores
  
  It's not enough to have your server and service up and running.  You
  have to maintain the server even when everything seems to be
  fine. This includes security auditing, keeping an eye on the size of
  remaining unused disk space, available RAM, the load of the system,
  etc.
  
  If you forget about these chores one day (sooner or later) your system
  will crash either because it has run out of free disk space, all the
  available CPU has been used and system has started heavily to swap or
  someone has broken in. Unfortunately the scope of this guide is not
  covering the latter, since it will take more than one book to
  profoundly cover this issue, but the rest of the thing are quite easy
  to prevent if you follow our advices.
  
  Certainly, your particular system might have maintenance chores that
  aren't covered here, but at least you will be alerted that these
  chores are real and should be taken care of.
  
  =head2 Handling Log Files
  
  There are two issues to solve with log files. First they should be
  rotated and compressed on the constant basis, since they tend to use
  big parts of the disk space over time. Second these should be
  monitored for possible sudden explosive growth rates, when something
  goes astray in your code running at the mod_perl server and the
  process starts to log thousands of error messages in second without
  stopping, until all the disk space is used, and the server cannot
  work anymore.
  
  =head3 Log Rotation
  
  The first issue is solved by having a process run by crontab at
  certain times (usually off hours, if this term is still valid in the
  Internet era) and rotate the logs. The log rotation includes the
  current log file renaming, server restart (which creates a fresh new
  log file), and renamed file compression and/or moving it on a
  different disk.
  
  For example if we want to rotate the I<access_log> file we could do:
  
    % mv access_log access_log.renamed
    % apachectl restart
    % sleep 5; # allow all children to complete requests and logging
               # now it's safe to use access_log.renamed
    % mv access_log.renamed /some/directory/on/another/disk
  
  This is the script that we run from the crontab to rotate the log
  files:
  
    #!/usr/local/bin/perl -Tw
    
    # This script does log rotation. Called from crontab.
    
    use strict;
    $ENV{PATH}='/bin:/usr/bin';
    
    ### configuration
    my @logfiles = qw(access_log error_log);
    umask 0;
    my $server = "httpd_perl";
    my $logs_dir = "/usr/local/var/$server/logs";
    my $restart_command = "/usr/local/sbin/$server/apachectl restart";
    my $gzip_exec = "/usr/bin/gzip";
    
    my ($sec,$min,$hour,$mday,$mon,$year) = localtime(time);
    my $time = sprintf "%0.4d.%0.2d.%0.2d-%0.2d.%0.2d.%0.2d",
         $year+1900,++$mon,$mday,$hour,$min,$sec;
    $^I = ".$time";
    
    # rename log files
    chdir $logs_dir;
    @ARGV = @logfiles;
    while (<>) {
      close ARGV;
    }
    
    # now restart the server so the logs will be restarted
    system $restart_command;
    
    # allow all children to complete requests and logging
    sleep 5;
  
    # compress log files
    foreach (@logfiles) {
        system "$gzip_exec $_.$time";
    }
  
  Note: Setting C<$^I> sets the in-place edit flag to a dot followed by
  the time.  We copy the names of the logfiles into C<@ARGV>, and open
  each in turn and immediately close them without doing any changes; but
  because the in-place edit flag is set they are effectively renamed.
  
  As you see the rotated files will include the date and the time in
  their filenames.
  
  Here is a more generic set of scripts for log rotation.  Cron job
  fires off setuid script called log-roller that looks like this:
  
    #!/usr/bin/perl -Tw
    use strict;
    use File::Basename;
    
    $ENV{PATH} = "/usr/ucb:/bin:/usr/bin";
    
    my $ROOT = "/WWW/apache"; # names are relative to this
    my $CONF = "$ROOT/conf/httpd.conf"; # master conf
    my $MIDNIGHT = "MIDNIGHT";  # name of program in each logdir
    
    my ($user_id, $group_id, $pidfile); # will be set during parse of conf
    die "not running as root" if $>;
    
    chdir $ROOT or die "Cannot chdir $ROOT: $!";
    
    my %midnights;
    open CONF, "<$CONF" or die "Cannot open $CONF: $!";
    while (<CONF>) {
      if (/^User (\w+)/i) {
        $user_id = getpwnam($1);
        next;
      }
      if (/^Group (\w+)/i) {
        $group_id = getgrnam($1);
        next;
      }
      if (/^PidFile (.*)/i) {
        $pidfile = $1;
        next;
      }
     next unless /^ErrorLog (.*)/i;
      my $midnight = (dirname $1)."/$MIDNIGHT";
      next unless -x $midnight;
      $midnights{$midnight}++;
    }
    close CONF;
    
    die "missing User definition" unless defined $user_id;
    die "missing Group definition" unless defined $group_id;
    die "missing PidFile definition" unless defined $pidfile;
    
    open PID, $pidfile or die "Cannot open $pidfile: $!";
    <PID> =~ /(\d+)/;
    my $httpd_pid = $1;
    close PID;
    die "missing pid definition" unless defined $httpd_pid and $httpd_pid;
    kill 0, $httpd_pid or die "cannot find pid $httpd_pid: $!";
    
    
    for (sort keys %midnights) {
      defined(my $pid = fork) or die "cannot fork: $!";
      if ($pid) {
        ## parent:
        waitpid $pid, 0;
      } else {
        my $dir = dirname $_;
        ($(,$)) = ($group_id,$group_id);
        ($<,$>) = ($user_id,$user_id);
        chdir $dir or die "cannot chdir $dir: $!";
        exec "./$MIDNIGHT";
        die "cannot exec $MIDNIGHT: $!";
      }
    }
    
    kill 1, $httpd_pid or die "Cannot SIGHUP $httpd_pid: $!";
  
  And then individual C<MIDNIGHT> scripts can look like this:
  
    #!/usr/bin/perl -Tw
    use strict;
    
    die "bad guy" unless getpwuid($<) =~ /^(root|nobody)$/;
    my @LOGFILES = qw(access_log error_log);
    umask 0;
    $^I = ".".time;
    @ARGV = @LOGFILES;
    while (<>) {
      close ARGV;
    }
  
  Can you spot the security holes? Take your time...  This code
  shouldn't be used in hostile situations.
  
  =head3 Non-Scheduled Emergency Log Rotation
  
  As we have mentioned before, there are times when the web server goes
  wild and starts to log lots of messages to the I<error_log> file
  non-stop.  If no one monitors this, it possible that in a few minutes
  all the free disk spaces will be filled and no process will be able to
  work normally. When this happens, the I/O the faulty server causes is
  so heavy that its sibling processes cannot serve requests.
  
  Generally this not the case, but a few people have reported to
  encounter this problem.  If you are one of these people, you should
  run the monitoring program that checks the log file size and if it
  notices that the file has grown too large, it should attempt to
  restart the server and probably trim the log file.
  
  When we have used a quite old mod_perl version, sometimes we have had
  bursts of an error I<Callback called exit> showing up in our
  I<error_log>.  The file could grow to 300 Mbytes in a few minutes.
  
  We will show you is an example of the script that should be executed
  from the crontab, to handle the situations like this.  The cron job
  should run every few minutes or even every minute, since if you
  experience this problem you know that log files fills up very fast.
  The example script will rotate when the I<error_log> will grow over
  100K. Note that this script is useful when you have the normal
  scheduled log rotation facility working, remember that this one is an
  emergency solver and not to be used for routine log rotation.
  
    emergency_rotate.sh
    -------------------
    #!/bin/sh
    S=`ls -s /usr/local/apache/logs/error_log | awk '{print $1}'`
    if [ "$S" -gt 100000 ] ; then
      mv /usr/local/apache/logs/error_log /usr/local/apache/logs/error_log.old
      /etc/rc.d/init.d/httpd restart
      date | /bin/mail -s "error_log $S kB on inx" admin@example.com
    fi
  
  Of course you could write a more advanced script, using the timestamps
  and other whistles. This example comes to illustrate how to solve the
  problem in question.
  
  Another solution is to use an out of box tools that are written for
  this purpose. The C<daemontools> package
  (ftp://koobera.math.uic.edu/www/daemontools.html) includes a utility
  called C<multilog>.  This utility saves stdin stream to one or more
  log files. It optionally timestamps each line and, for each log,
  includes or excludes lines matching specified patterns. It
  automatically rotates logs to limit the amount of disk space used. If
  the disk fills up, it pauses and tries again, without losing any data.
  
  The obvious caveat is that it doesn't restart the server, so while it
  tries to solve the log file handling problem it doesn't handle the
  originator of the problem. But since the I/O of the log writing
  process Apache process will be quite heavy, the rest of the servers
  will work very slowly if at all, and a normal watchdog should detect
  this abnormal situation and restart the Apache server.
  
  =head1 Swapping Prevention
  
  Before I delve into swapping process details, let's refresh our
  knowledge of memory components and memory management
  
  The computer memory is called RAM, which stands for Random Access
  Memory.  Reading and writing to RAM is, by a few orders, faster than
  doing the same operations on a hard disk, the former uses non-movable
  memory cells, while the latter uses rotating magnetic media.
  
  On most operating systems swap memory is used as an extension for RAM
  and not as a duplication of it. So if your OS is one of those, if you
  have 128MB of RAM and 256MB swap partition, you have a total of 384MB
  of memory available. You should never count the extra memory when you
  decide on the maximum number of processes to be run, and I will show
  why in the moment.
  
  The swapping memory can be built of a number of hard disk partitions
  and swap files formatted to be used as swap memory. When you need more
  swap memory you can always extend it on demand as long as you have
  some free disk space (for more information see the I<mkswap> and
  I<swapon> manpages).
  
  System memory is quantified in units called memory pages. Usually the
  size of a memory page is between 1KB and 8KB.  So if you have 256MB of
  RAM installed on your machine and the page size is 4KB your system has
  64,000 main memory pages to work with and these pages are fast.  If
  you have 256MB swap partition the system can use yet another 64,000
  memory pages, but they are much slower.
  
  When the system is started all memory pages are available for use by
  the programs (processes).
  
  Unless the program is really small, the process running this program
  uses only a few segments of the program, each segment mapped onto its
  own memory page. Therefore only a few memory pages are required to be
  loaded into the memory.
  
  When the process needs an additional program's segment to be loaded
  into the memory, it asks the system whether the page containing this
  segment is already loaded in the memory. If the page is not found--an
  event know as a I<page fault> occurs, which requires the system to
  allocate a free memory page, go to the disk, read and load the
  requested program's segment into the allocated memory page.
  
  If a process needs to bring a new page into physical memory and there
  are no free physical pages available, the operating system must make
  room for this page by discarding another page from physical memory.
  
  If the page to be discarded from physical memory came from an image or
  data file and has not been written to then the page does not need to
  be saved. Instead it can be discarded and if the process needs that
  page again it can be brought back into memory from the image or data
  file.
  
  However, if the page has been modified, the operating system must
  preserve the contents of that page so that it can be accessed at a
  later time. This type of page is known as a I<dirty page> and when it
  is removed from memory it is saved in a special sort of file called
  the swap file. This process is referred to as a I<swapping out>.
  
  Accesses to the swap file are very long relative to the speed of the
  processor and physical memory and the operating system must juggle the
  need to write pages to disk with the need to retain them in memory to
  be used again.
  
  In order to improve the swapping out process, to decrease the
  possibility that the page that has just been swapped out, will be
  needed at the next moment, the LRU (least recently used) or a similar
  algorithm is used.
  
  To summarize the two swapping scenarios, read-only pages discarding
  incurs no overhead in contrast with the discarding scenario of the
  data pages that have been written to, since in the latter case the
  pages have to be written to a swap partition located on the slow disk.
  Therefore your machine's overall performance will be much better if
  there will be less memory pages that can become dirty.
  
  But the problem is, Perl is a language with no strong data types,
  which means that both the program code and the program data are seen
  as a data pages by OS since both mapped to the same memory
  pages. Therefore a big chunk of your Perl code becomes dirty when its
  variables are modified and when the pages need to be discarded they
  have to be written to the swap partition.
  
  This leads us to two important conclusions about swapping and Perl.
  
  =over 
  
  =item *
  
  Running your system when there is no free main memory available
  hinders performance, because processes memory pages should be
  discarded and then reread from disk again and again.
  
  =item *
  
  Since a majority of the running code is a Perl code, in addition to
  the overhead of reading the previously discarded pages in, the
  overhead of saving the dirty pages to the swap partition is occurring.
  
  =back
  
  When the system has to swap memory pages in and out, the system slows
  down, not serving the processes as fast as before. This leads to an
  accumulation of processes waiting for their turn to run, which further
  causes processing demands to go up, which in turn slows down the
  system even more as more memory is required.  This ever worsening
  spiral will lead the machine to halt, unless the resource demand
  suddenly drops down and allows the processes to catch up with their
  tasks and go back to normal memory usage.
  
  In addition it's important to know that for a better performance, most
  programs, particularly programs written in Perl, on most modern OSs
  don't return memory pages while they are running. If some of the
  memory gets freed it's reused when needed by the process, without
  creating the additional overhead of asking the system to allocate new
  memory pages.  That's why you will observe that Perl programs grow in
  size as they run and almost never shrink.
  
  When the process quits it returns its memory pages to the pool of
  freely available pages for other processes to use.
  
  This scenario is certainly educating, and it should be now obvious
  that your system that runs the web server should never swap. It's
  absolutely normal for your desktop to start swapping. You will see it
  immediately since things will slow down and sometimes the system will
  freeze for a short periods. But as I've just mentioned, you can stop
  starting new programs and can quit some, thus allowing the system to
  catch up with the load and come back to use the RAM.
  
  In the case of the web server you have much less control since it's
  users who load your machine by issuing requests to your server.
  Therefore you should configure the server, so that the maximum number
  of possible processes will be small enough using the C<MaxClients>
  directive (For the technique for choosing the right C<MaxClients>
  refer to the section 'L<Choosing
  MaxClients|guide::performance/Choosing_MaxClients>'). This will ensure that 
  at peak hours the system won't swap. Remember that swap space is an
  emergency pool, not a resource to be used routinely.  If you are low
  on memory and you badly need it, buy it or reduce the number of
  processes to prevent swapping.
  
  However sometimes, due to the faulty code, some process might start
  spinning in an unconstrained loop, consuming all the available RAM and
  starting to heavily use swap memory. In such a situation it helps when
  you have a big emergency pool (i.e. lots of swap memory). But you have
  to resolve this problem as soon as possible since this pool won't last
  for a long time. In the meanwhile the C<Apache::Resource> module can
  be handy.
  
  For swapping monitoring techniques see the section 'L<Apache::VMonitor
  -- Visual System and Apache Server
  Monitor|guide::debug/Apache__VMonitor____Visual_System_and_Apache_Server_Monitor>'.
  
  =head1 Preventing mod_perl Processes From Going Wild
  
  Sometimes people report that they had a problem with their code
  running under mod_perl that has caused all the RAM or all the disk to
  be used. The following tips should help you prevent these problems,
  before if at all they hit you.
  
  =head2 All RAM Consumed
  
  Sometimes calling an undefined subroutine in a module can cause a
  tight loop that consumes all the available memory.  Here is a way to
  catch such errors.  Define an C<UNIVERSAL::AUTOLOAD> subroutine in
  your I<startup.pl>, or in a E<lt>PerlE<gt>E<lt>/PerlE<gt> section in
  your I<httpd.conf> file:
  
    sub UNIVERSAL::AUTOLOAD {
      my $class = shift;
      warn "$class can't \$UNIVERSAL::AUTOLOAD=$UNIVERSAL::AUTOLOAD!\n";
    }
  
  You can either put it in your startup.pl, or in a
  C<E<lt>PerlE<gt>E<lt>/PerlE<gt>> section in your httpd.conf file.  I
  do the latter.  Putting it in all your mod_perl modules would be
  redundant (and might give you compile-time errors).
  
  This will produce a nice error in I<error_log>, giving the line number
  of the call and the name of the undefined subroutine.
  
  =head1 Maintainers
  
  Maintainer is the person(s) you should contact with updates,
  corrections and patches.
  
  =over
  
  =item *
  
  Stas Bekman E<lt>stas (at) stason.orgE<gt>
  
  =back
  
  
  =head1 Authors
  
  =over
  
  =item *
  
  Stas Bekman E<lt>stas (at) stason.orgE<gt>
  
  =back
  
  Only the major authors are listed above. For contributors see the
  Changes file.
  
  
  =cut
  
  
  
  
  1.1                  modperl-docs/src/docs/general/correct_headers/correct_headers.pod
  
  Index: correct_headers.pod
  ===================================================================
  =head1 NAME
  
  Issuing Correct HTTP Headers
  
  =head1 Description
  
  To make caching of dynamic documents possible, which can give you a
  considerable performance gain, setting a number of HTTP headers is of
  a vital importance. This document explains which headers you need to
  pay attention to, and how to work with them.
  
  As there is always more than one way to do it, I'm tempted to
  believe one must be the best.  Hardly ever am I right.
  
  =head1 The Origin of this Chapter
  
  This chapter has been contributed to the documentation by Andreas
  Koenig.  You will find the references and other related info at the
  bottom of this page. It was previously distributed from CPAN, but this
  documentation is now its official resting-place.
  
  If you have any questions regarding this specific document only,
  please refer to Andreas, since he is the guru on this subject.  On any
  other matter please contact the L<mod_perl mailing
  list|maillist::modperl>.
  
  
  =head1 Why Headers
  
  Dynamic Content is dynamic, after all, so why would anybody care about
  HTTP headers?  Header composition is a task often neglected in the CGI
  world.  Because pages are generated dynamically, you might expect that
  pages without a C<Last-Modified> header are fine, and that an
  C<If-Modified-Since> header in the browser's request can be ignored.
  This laissez-faire principle gets in the way when you try to establish
  a server that is entirely driven by dynamic components and the number
  of hits is significant.
  
  If the number of hits is not significant, don't bother to read this
  document.
  
  If the number of hits is significant, you might want to consider what
  cache-friendliness means (you may also want to read
  L<[4]|general::correct_headers::correct_headers/_4_>) and how you can cooperate with caches to
  increase the performance of your site.  Especially if you use Squid in
  accelerator mode (helpful hints for Squid, see
  L<[1]|general::correct_headers::correct_headers/_1_>), you will have a strong motivation to
  cooperate with it.  This document may help you to do it correctly.
  
  =head1 Which Headers
  
  The HTTP standard (v 1.1 is specified in L<[3]|general::correct_headers::correct_headers/_3_>, v
  1.0 in L<[2]|general::correct_headers::correct_headers/_2_>) describes lots of headers.  In this
  document, we only discuss those headers which are most relevant to
  caching.
  
  I have grouped the headers into three groups: date headers,
  content headers, and the special Vary header.
  
  =head2 Date Related Headers
  
  =head3 Date
  
  Section 14.18 of the HTTP standard deals with the circumstances under
  which you must or must not send a C<Date> header.  For almost
  everything a normal mod_perl user is doing, a C<Date> header needs to
  be generated.  But the mod_perl programmer doesn't have to worry about
  this header since the Apache server guarantees that this header is
  sent.
  
  In C<http_protocol.c> the C<Date> header is set according to
  C<$r-E<gt>request_time>.  A mod_perl script can read, but not change,
  C<$r-E<gt>request_time>.
  
  =head3 Last-Modified
  
  Section 14.29 of the HTTP standard deals with this.  The
  C<Last-Modified> header is mostly used as a so-called weak
  validator.  Here are two sentences from the HTTP specs:
  
    A validator that does not always change when the resource
    changes is a "weak validator."
  
    One can think of a strong validator as one that changes
    whenever the bits of an entity changes, while a weak value
    changes whenever the meaning of an entity changes.
  
  This tells us that we should consider the semantics of the page we are
  generating and not the date when we are running.  The question is,
  when did the B<meaning> of this page change last time?  Let's imagine
  the document in question is a text-to-gif renderer that takes as input
  a font to use, background and foreground colours, and a string to
  render.  Although the actual image is created on-the-fly, the
  semantics of the page are determined when the script was last changed,
  right?
  
  Actually, a few more things are relevant: the semantics also change a
  little when you update one of the fonts that may be used or when you
  update your C<ImageMagick> or equivalent program.  It's something you
  should consider, if you want to get it right.
  
  If you have a page which comprises several components, you should ask
  all the components when they changed their semantic behaviour last
  time.  Then pick the oldest of those times.
  
  mod_perl offers you two convenient methods to deal with this header:
  update_mtime() and set_last_modified().  These methods and several
  others are unavailable in the normal mod_perl environment but are
  silently imported when you use C<Apache::File>.  Refer to the
  C<Apache::File> manpage for more info.
  
  update_mtime() takes a UNIX time as its argument and sets Apache's
  request structure finfo.st_mtime to this value.  It does so only when
  the argument is greater than a previously stored C<finfo.st_mtime>.
  
  set_last_modified() sets the outgoing header C<Last-Modified> to the
  string that corresponds to the stored finfo.st_mtime.  By passing a
  UNIX time to set_last_modified(), mod_perl calls update_mtime() with
  this argument first.
  
    use Apache::File;
    use Date::Parse;
    # Date::Parse parses RCS format, Apache::Util::parsedate doesn't
    $Mtime ||=
      Date::Parse::str2time(substr q$Date: 2002/07/31 14:41:49 $, 6);
    $r->set_last_modified($Mtime);
  
  =head3 Expires and Cache-Control
  
  Section 14.21 of the HTTP standard deals with the C<Expires>
  header.  The purpose of the C<Expires> header is to determine a point
  in time after which the document should be considered out of date
  (stale).  Don't confuse this with the very different meaning of the
  C<Last-Modified> header.  The C<Expires> header is useful to avoid
  unnecessary validation from now on until the document expires and it
  helps the recipients to clean up their stored documents.  A sentence
  from the HTTP standard:
  
    The presence of an Expires field does not imply that the
    original resource will change or cease to exist at, before, or
    after that time.
  
  So think before you set up a time when you believe a resource should
  be regarded as stale.  Most of the time I can determine an expected
  lifetime from "now", that is the time of the request.  I would not
  recommend hardcoding the date of Expiry, because when you forget that
  you did it, and the date arrives, you will serve "already expired"
  documents that cannot be cached at all by anybody.  If you believe a
  resource will never expire, read this quote from the HTTP specs:
  
    To mark a response as "never expires," an origin server sends an
    Expires date approximately one year from the time the response is
    sent.  HTTP/1.1 servers SHOULD NOT send Expires dates more than one
    year in the future.
  
  Now the code for the mod_perl programmer who wants to expire a
  document half a year from now:
  
    $r->header_out('Expires',
                   HTTP::Date::time2str(time + 180*24*60*60));
  
  A very handy alternative to this computation is available in HTTP 1.1,
  the cache control mechanism. Instead of setting the C<Expires> header
  you can specify a delta value in a C<Cache-Control> header. You can do
  that by executing just:
  
    $r->header_out('Cache-Control', "max-age=" . 180*24*60*60);
  
  which is, of course much cheaper than the first example because perl
  computes the value only once at compile time and optimizes it into a
  constant.
  
  As this alternative is only available in HTTP 1.1 and old cache
  servers may not understand this header, it is advisable to send both
  headers.  In this case the C<Cache-Control> header takes precedence, so
  the C<Expires> header is ignored on HTTP 1.1 compliant servers.  Or you
  could go with an if/else clause:
  
    if ($r->protocol =~ /(\d\.\d)/ && $1 >= 1.1){
      $r->header_out('Cache-Control', "max-age=" . 180*24*60*60);
    } else {
      $r->header_out('Expires',
                     HTTP::Date::time2str(time + 180*24*60*60));
    }
  
  If you restart your Apache server regularly, I'd save the C<Expires>
  header in a global variable.  Oh, well, this is probably
  over-engineered now.
  
  To avoid caching altogether call:
  
    $r->no_cache(1);
  
  which sets the headers:
  
    Pragma: no-cache
    Cache-control: no-cache
  
  which should work in major browsers.
  
  Don't set C<Expires> with C<$r-E<gt>header_out> if you use
  C<$r-E<gt>no_cache>, because header_out() takes precedence.  The
  problem that remains is that there are broken browsers which ignore
  C<Expires> headers.
  
  =head2 Content Related Headers
  
  =head3 Content-Type
  
  You are most probably familiar with C<Content-Type>.  Sections 3.7,
  7.2.1 and 14.17 of the HTTP specs cover the details.  mod_perl has the
  C<content_type()> method to deal with this header, for example:
  
    $r->content_type("image/png");
  
  C<Content-Type> I<should> be included in all messages according to the
  specs, and Apache will generate one if you don't.  It will be whatever
  is specified in the relevant C<DefaultType> configuration directive or
  C<text/plain> if none is active.
  
  =head3 Content-Length
  
  According to section 14.13 of the HTTP specifications, the
  C<Content-Length> header is the number of octets in the body of a
  message.  If it can be determined prior to sending, it can be very
  useful for several reasons to include it.  The most important reason
  why it is good to include it is that keepalive requests only work with
  responses that contain a C<Content-Length> header.  In mod_perl you
  can say
  
    $r->header_out('Content-Length', $length);
  
  If you use C<Apache::File>, you get the additional
  C<set_content_length()> method for the Apache class which is a bit
  more efficient than the above.  You can then say:
  
    $r->set_content_length($length);
  
  The C<Content-Length> header can have an important impact on caches by
  invalidating cache entries as the following extract from the
  specification explains:
  
    The response to a HEAD request MAY be cacheable in the sense that
    the information contained in the response MAY be used to update a
    previously cached entity from that resource.  If the new field values
    indicate that the cached entity differs from the current entity (as
    would be indicated by a change in Content-Length, Content-MD5, ETag
    or Last-Modified), then the cache MUST treat the cache entry as
    stale.
  
  So be careful never to send a wrong C<Content-Length>, either in a
  GET or in a HEAD request.
  
  =head3 Entity Tags
  
  An C<Entity Tag> is a validator which can be used instead of, or in
  addition to, the C<Last-Modified> header.  An entity tag is a quoted
  string which can be used to identify different versions of a
  particular resource.  An entity tag can be added to the response
  headers like so:
  
    $r->header_out("ETag","\"$VERSION\"");
  
  Note: mod_perl offers the C<Apache::set_etag()> method if you have
  loaded C<Apache::File>.  It is strongly recommended that you I<do not>
  use this method unless you know what you are doing.  C<set_etag()> is
  expecting to be used in conjunction with a static request for a file
  on disk that has been stat()ed in the course of the current request.
  It is inappropriate and "dangerous" to use it for dynamic content.
  
  By sending an entity tag you promise the recipient that you will not
  send the same C<ETag> for the same resource again unless the content
  is I<'equal'> to what you are sending now (see below for what equality
  means).
  
  The pros and cons of using entity tags are discussed in section 13.3
  of the HTTP specs. For us mod_perl programmers that discussion can be
  summed up as follows:
  
  There are strong and weak validators.  Strong validators change
  whenever a single bit changes in the response.  Weak validators change
  when the meaning of the response changes.  Strong validators are needed
  for caches to allow for sub-range requests.  Weak validators allow a
  more efficient caching of equivalent objects.  Algorithms like MD5 or
  SHA are good strong validators, but what we usually want, when we want
  to take advantage of caching, is a good weak validator.
  
  A C<Last-Modified> time, when used as a validator in a request, can be
  strong or weak, depending on a couple of rules.  Please refer to
  section 13.3.3 of the HTTP standard to understand these rules.  This
  is mostly relevant for range requests as this citation of section
  14.27 explains:
  
    If the client has no entity tag for an entity, but does have a
    Last-Modified date, it MAY use that date in a If-Range header.
  
  But it is not limited to range requests.  Section 13.3.1 succinctly
  states that:
  
    The Last-Modified entity-header field value is often used as a
    cache validator.
  
  The fact that a C<Last-Modified> date may be used as a strong
  validator can be pretty disturbing if we are in fact changing our
  output slightly without changing the semantics of the output.  To
  prevent these kinds of misunderstanding between us and the cache
  servers in the response chain, we can send a weak validator in an
  C<ETag> header.  This is possible because the specs say:
  
    If a client wishes to perform a sub-range retrieval on a value for
    which it has only a Last-Modified time and no opaque validator, it
    MAY do this only if the Last-Modified time is strong in the sense
    described here.
  
  In other words: by sending them an C<ETag> that is marked as weak we
  prevent them from using the Last-Modified header as a strong
  validator.
  
  An C<ETag> value is marked as a weak validator by preceding the
  string C<W/> to the quoted string, otherwise it is strong.  In perl
  this would mean something like this:
  
    $r->header_out('ETag',"W/\"$VERSION\"");
  
  Consider carefully which string you choose to act as a validator.  You
  are on your own with this decision because...
  
    ... only the service author knows the semantics of a resource
    well enough to select an appropriate cache validation
    mechanism, and the specification of any validator comparison
    function more complex than byte-equality would open up a can
    of worms.  Thus, comparisons of any other headers (except
    Last-Modified, for compatibility with HTTP/1.0) are never used
    for purposes of validating a cache entry.
  
  If you are composing a message from multiple components, it may be
  necessary to combine some kind of version information for all these
  components into a single string.
  
  If you are producing relatively large documents, or content that does
  not change frequently, you most likely will prefer a strong entity
  tag, thus giving caches a chance to transfer the document in chunks.
  (Anybody in the mood to add a chapter about ranges to this document?)
  
  =head2 Content Negotiation
  
  Content negotiation is a particularly wonderful feature that was
  introduced with HTTP 1.1.  Unfortunately it is not yet widely
  supported.  Probably the most popular usage scenario of content
  negotiation is language negotiation.  A user specifies in the browser
  preferences the languages they understand and how well they understand
  them.  The browser includes these settings in an C<Accept-Language>
  header when it sends the request to the server and the server then
  chooses from several available representations of the document the one
  that best fits the user's preferences.  Content negotiation is not
  limited to language.  Citing the specs:
  
    HTTP/1.1 includes the following request-header fields for enabling
    server-driven negotiation through description of user agent
    capabilities and user preferences: Accept (section 14.1), Accept-
    Charset (section 14.2), Accept-Encoding (section 14.3), Accept-
    Language (section 14.4), and User-Agent (section 14.43). However, an
    origin server is not limited to these dimensions and MAY vary the
    response based on any aspect of the request, including information
    outside the request-header fields or within extension header fields
    not defined by this specification.
  
  =head3 Vary
  
  In order to signal to the recipient that content negotiation has been
  used to determine the best available representation for a given
  request, the server must include a C<Vary> header.  This tells the
  recipient which request headers have been used to determine it.  So an
  answer may be generated like this:
  
    $r->header_out('Vary', join ", ", 
          qw(accept accept-language accept-encoding user-agent));
  
  The header of a very cool page may greet the user with something like
  
    Hallo Kraut, Dein NutScrape versteht zwar PNG aber leider
    kein GZIP.
  
  but it has the side effect of being expensive for a caching proxy.  As
  of this writing, Squid (version 2.1PATCH2) does not cache resources
  that come with a Vary header at all.  So unless you find a clever
  workaround, you won't enjoy your Squid accelerator for these documents
  :-(
  
  =head1 Requests
  
  Section 13.11 of the specifications states that the only two cacheable
  methods are C<GET> and C<HEAD>.
  
  =head2 HEAD
  
  Among the above recommended headers, the date-related ones (C<Date>,
  C<Last-Modified>, and C<Expires>/C<Cache-Control>) are usually easy to
  produce and thus should be computed for C<HEAD> requests just the same
  as for C<GET> requests.
  
  The C<Content-Type> and C<Content-Length> headers should be exactly
  the same as would be supplied to the corresponding C<GET> request.
  But as it can be expensive to compute them, they can just as well be
  omitted, since there is nothing in the specs that forces you to
  compute them.
  
  What is important for the mod_perl programmer is that the response to
  a C<HEAD> request I<must not> contain a message-body.  The code in your
  mod_perl handler might look like this:
  
    # compute the headers that are easy to compute
    if ( $r->header_only ){ # currently equivalent to $r->method eq "HEAD"
      $r->send_http_header;
      return OK;
    }
  
  If you are running a Squid accelerator, it will be able to handle the
  whole C<HEAD> request for you, but under some circumstances it may not
  be allowed to do so.
  
  =head2 POST
  
  The response to a C<POST> request is not cacheable due to an
  underspecification in the HTTP standards.  Section 13.4 does not forbid
  caching of responses to C<POST> requests but no other part of the HTTP
  standard explains how caching of C<POST> requests could be
  implemented, so we are in a vacuum here and all existing caching
  servers therefore refuse to implement caching of C<POST>
  requests.  This may change if somebody does the groundwork of defining
  the semantics for cache operations on C<POST>.  Note that some browsers
  with their more aggressive caching do implement caching of C<POST>
  requests.
  
  Note: If you are running a Squid accelerator, you should be aware that
  it accelerates outgoing traffic, but does not bundle incoming traffic.
  If you have long C<POST> requests, Squid doesn't buy you anything.  So
  always consider using a C<GET> instead of a C<POST> if possible.
  
  =head2 GET
  
  A normal C<GET> is what we usually write our mod_perl programs for.
  Nothing special about it.  We send our headers followed by the body.
  
  But there is a certain case that needs a workaround to achieve better
  cacheability.  We need to deal with the "?" in the rel_path part of
  the requested URI.  Section 13.9 specifies that
  
    ... caches MUST NOT treat responses to such URIs as fresh unless
    the server provides an explicit expiration time.  This specifically
    means that responses from HTTP/1.0 servers for such URIs SHOULD NOT
    be taken from a cache.
  
  You're tempted to believe that if we are using HTTP 1.1 and send an
  explicit expiration time we're on the safe side?  Unfortunately
  reality is a little bit different.  It has been a bad habit for quite
  a long time to misconfigure cache servers such that they treat all
  C<GET> requests containing a question mark as uncacheable.  People
  even used to mark everything as uncacheable that contained the string
  C<cgi-bin>.
  
  To work around this bug in the C<HEAD> requests, I have stopped
  calling my CGI directories C<cgi-bin> and I have written the following
  handler that lets me work with CGI-like query strings without
  rewriting the software (such as C<Apache::Request> and C<CGI.pm>) that
  deals with them.
  
    sub handler {
      my($r) = @_;
      my $uri = $r->uri;
      if ( my($u1,$u2) = $uri =~ / ^ ([^?]+?) ; ([^?]*) $ /x ) {
        $r->uri($u1);
        $r->args($u2);
      } elsif ( my($u1,$u2) = $uri =~ m/^(.*?)%3[Bb](.*)$/ ) {
        # protect against old proxies that escape volens nolens
        # (see HTTP standard section 5.1.2)
        $r->uri($u1);
        $u2 =~ s/%3B/;/gi;
        $u2 =~ s/%26/;/gi; # &
        $u2 =~ s/%3D/=/gi;
        $r->args($u2);
      }
      DECLINED;
    }
  
  This handler must be installed as a C<PerlPostReadRequestHandler>.
  
  The handler takes any request that contains one or more semicolons but
  I<no> question mark such that the first semicolon is interpreted as a
  question mark and everything after that as the query string.  You can
  now exchange the request:
  
    http://example.com/query?BGCOLOR=blue;FGCOLOR=red
  
  with:
  
    http://example.com/query;BGCOLOR=blue;FGCOLOR=red
  
  Thus it allows the co-existence of queries from ordinary forms that
  are being processed by a browser and predefined requests for the same
  resource.  It has one minor bug: Apache doesn't allow percent-escaped
  slashes in such a query string.  So instead of:
  
    http://example.com/query;BGCOLOR=blue;FGCOLOR=red;FONT=%2Ffont%2Fbla
  
  you have to use:
  
    http://example.com/query;BGCOLOR=blue;FGCOLOR=red;FONT=/font/bla
  
  =head2 Conditional GET
  
  A rather challenging request mod_perl programmers can get is the
  conditional C<GET>, which typically means a request with an
  If-Modified-Since header.  The HTTP specifications have this to say:
  
    The semantics of the GET method change to a "conditional GET"
    if the request message includes an If-Modified-Since,
    If-Unmodified-Since, If-Match, If-None-Match, or If-Range
    header field.  A conditional GET method requests that the
    entity be transferred only under the circumstances described
    by the conditional header field(s). The conditional GET method
    is intended to reduce unnecessary network usage by allowing
    cached entities to be refreshed without requiring multiple
    requests or transferring data already held by the client.
  
  So how can we reduce the unnecessary network usage in such a case?
  mod_perl makes it easy for you by offering Apache's
  C<meets_conditions()>.  You have to set up your C<Last-Modified> (and
  possibly C<ETag>) header before calling this method.  If the return
  value of this method is anything other than C<OK>, you should return
  that value from your handler and you're done.  Apache handles the rest
  for you.  The following example is taken from
  L<[5]|general::correct_headers::correct_headers/_5_>:
  
    if((my $rc = $r->meets_conditions) != OK) {
       return $rc;
    }
    #else ... go and send the response body ...
  
  If you have a Squid accelerator running, it will often handle the
  conditionals for you and you can enjoy its extremely fast responses
  for such requests by reading the I<access.log>.  Just grep for
  C<TCP_IMS_HIT/304>.  But as with a C<HEAD> request there are
  circumstances under which it may not be allowed to do so.  That is why
  the origin server (which is the server you're programming) needs to
  handle conditional C<GET>s as well even if a Squid accelerator is
  running.
  
  =head1 Avoiding Dealing with Headers
  
  There is another approach to dynamic content that is possible with
  mod_perl.  This approach is appropriate if the content changes
  relatively infrequently, if you expect lots of requests to retrieve
  the same content before it changes again and if it is much cheaper to
  test whether the content needs refreshing than it is to refresh it.
  
  In this case a C<PerlFixupHandler> can be installed for the relevant
  location.  It tests whether the content is up to date.  If so, it
  returns C<DECLINED> and lets the Apache core serve the content from a
  file.  Otherwise, it regenerates the content into the file, updates
  the C<$r-E<gt>finfo> status and again returns C<DECLINED> so that
  Apache serves the updated file.  Updating C<$r-E<gt>finfo> can be
  achieved by calling
  
    $r->filename($file); # force update of finfo
  
  even if this seems redundant because the filename is already equal to
  C<$file>.  Setting the filename has the side effect of doing a
  C<stat()> on the file.  This is important because otherwise Apache
  would use the out of date C<finfo> when generating the response
  header.
  
  =head1 References
  
  =head2 [1]
  
  Stas Bekman: L<mod_perl Guide|docs::1.0::guide::index>
  
  =head2  [2]
  
  T. Berners-Lee et al.: Hypertext Transfer Protocol -- HTTP/1.0, RFC
  1945.
  
  =head2 [3]
  
  R. Fielding et al.: Hypertext Transfer Protocol -- HTTP/1.1, RFC 2616.
  
  =head2 [4]
  
  Martin Hamilton: Cachebusting - cause and prevention,
  draft-hamilton-cachebusting-01. Also available online at
  http://vancouver-webpages.com/CacheNow/
  
  =head2 [5]
  
  Lincoln Stein, Doug MacEachern: Writing Apache Modules with Perl and
  C, O'Reilly, 1-56592-567-X. Selected chapters available online at
  http://www.modperl.com/ .
  
  =head1 Other resources
  
  =over
  
  =item *
  
  Prevent the browser from Caching a page
  http://www.pacificnet.net/~johnr/meta.html
  
  This page is an explanation of using the Meta tag to prevent caching, by
  browser or proxy, of an individual page wherein the page in question has
  data that may be of a sensitive nature as in a "form page for submittal"
  and the creator of the page wants to make sure that the page does not get
  submitted twice. Please notice that some of the information on this page
  is a little bit outdated, but it's still a good resource for those who
  cannot generate their own HTTP headers.
  
  =item *
  
  Web Caching and Content Delivery Resources
  http://www.web-caching.com/
  
  =back
  
  =head1 Maintainers
  
  Maintainer is the person(s) you should contact with updates,
  corrections and patches.
  
  =over
  
  =item *
  
  Stas Bekman E<lt>stas (at) stason.orgE<gt>
  
  =back
  
  
  =head1 Authors
  
  =over
  
  =item *
  
  Andreas Koenig E<lt>andreas.koenig (at) anima.deE<gt>
  
  =back
  
  Only the major authors are listed above. For contributors see the
  Changes file.
  
  
  =cut
  
  
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: docs-cvs-unsubscribe@perl.apache.org
For additional commands, e-mail: docs-cvs-help@perl.apache.org