You are viewing a plain text version of this content. The canonical link for it is here.
Posted to cvs@httpd.apache.org by pg...@apache.org on 2007/11/26 17:50:09 UTC

svn commit: r598339 [12/37] - in /httpd/httpd/vendor/pcre/current: ./ doc/ doc/html/ testdata/

Modified: httpd/httpd/vendor/pcre/current/doc/html/pcrebuild.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcrebuild.html?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcrebuild.html (original)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcrebuild.html Mon Nov 26 08:49:53 2007
@@ -14,34 +14,56 @@
 <br>
 <ul>
 <li><a name="TOC1" href="#SEC1">PCRE BUILD-TIME OPTIONS</a>
-<li><a name="TOC2" href="#SEC2">UTF-8 SUPPORT</a>
-<li><a name="TOC3" href="#SEC3">UNICODE CHARACTER PROPERTY SUPPORT</a>
-<li><a name="TOC4" href="#SEC4">CODE VALUE OF NEWLINE</a>
-<li><a name="TOC5" href="#SEC5">BUILDING SHARED AND STATIC LIBRARIES</a>
-<li><a name="TOC6" href="#SEC6">POSIX MALLOC USAGE</a>
-<li><a name="TOC7" href="#SEC7">LIMITING PCRE RESOURCE USAGE</a>
-<li><a name="TOC8" href="#SEC8">HANDLING VERY LARGE PATTERNS</a>
-<li><a name="TOC9" href="#SEC9">AVOIDING EXCESSIVE STACK USAGE</a>
-<li><a name="TOC10" href="#SEC10">USING EBCDIC CODE</a>
+<li><a name="TOC2" href="#SEC2">C++ SUPPORT</a>
+<li><a name="TOC3" href="#SEC3">UTF-8 SUPPORT</a>
+<li><a name="TOC4" href="#SEC4">UNICODE CHARACTER PROPERTY SUPPORT</a>
+<li><a name="TOC5" href="#SEC5">CODE VALUE OF NEWLINE</a>
+<li><a name="TOC6" href="#SEC6">WHAT \R MATCHES</a>
+<li><a name="TOC7" href="#SEC7">BUILDING SHARED AND STATIC LIBRARIES</a>
+<li><a name="TOC8" href="#SEC8">POSIX MALLOC USAGE</a>
+<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
+<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
+<li><a name="TOC11" href="#SEC11">LIMITING PCRE RESOURCE USAGE</a>
+<li><a name="TOC12" href="#SEC12">CREATING CHARACTER TABLES AT BUILD TIME</a>
+<li><a name="TOC13" href="#SEC13">USING EBCDIC CODE</a>
+<li><a name="TOC14" href="#SEC14">SEE ALSO</a>
+<li><a name="TOC15" href="#SEC15">AUTHOR</a>
+<li><a name="TOC16" href="#SEC16">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">PCRE BUILD-TIME OPTIONS</a><br>
 <P>
 This document describes the optional features of PCRE that can be selected when
-the library is compiled. They are all selected, or deselected, by providing
-options to the <b>configure</b> script that is run before the <b>make</b>
-command. The complete list of options for <b>configure</b> (which includes the
-standard ones such as the selection of the installation directory) can be
-obtained by running
+the library is compiled. It assumes use of the <b>configure</b> script, where
+the optional features are selected or deselected by providing options to
+<b>configure</b> before running the <b>make</b> command. However, the same
+options can be selected in both Unix-like and non-Unix-like environments using
+the GUI facility of <b>CMakeSetup</b> if you are using <b>CMake</b> instead of
+<b>configure</b> to build PCRE.
+</P>
+<P>
+The complete list of options for <b>configure</b> (which includes the standard
+ones such as the selection of the installation directory) can be obtained by
+running
 <pre>
   ./configure --help
 </pre>
-The following sections describe certain options whose names begin with --enable
-or --disable. These settings specify changes to the defaults for the
+The following sections include descriptions of options whose names begin with
+--enable or --disable. These settings specify changes to the defaults for the
 <b>configure</b> command. Because of the way that <b>configure</b> works,
 --enable and --disable always come in pairs, so the complementary option always
 exists as well, but as it specifies the default, it is not described.
 </P>
-<br><a name="SEC2" href="#TOC1">UTF-8 SUPPORT</a><br>
+<br><a name="SEC2" href="#TOC1">C++ SUPPORT</a><br>
+<P>
+By default, the <b>configure</b> script will search for a C++ compiler and C++
+header files. If it finds them, it automatically builds the C++ wrapper library
+for PCRE. You can disable this by adding
+<pre>
+  --disable-cpp
+</pre>
+to the <b>configure</b> command.
+</P>
+<br><a name="SEC3" href="#TOC1">UTF-8 SUPPORT</a><br>
 <P>
 To build PCRE with support for UTF-8 character strings, add
 <pre>
@@ -52,7 +74,7 @@
 have to set the PCRE_UTF8 option when you call the <b>pcre_compile()</b>
 function.
 </P>
-<br><a name="SEC3" href="#TOC1">UNICODE CHARACTER PROPERTY SUPPORT</a><br>
+<br><a name="SEC4" href="#TOC1">UNICODE CHARACTER PROPERTY SUPPORT</a><br>
 <P>
 UTF-8 support allows PCRE to process character values greater than 255 in the
 strings that it handles. On its own, however, it does not provide any
@@ -66,25 +88,57 @@
 not explicitly requested it.
 </P>
 <P>
-Including Unicode property support adds around 90K of tables to the PCRE
-library, approximately doubling its size. Only the general category properties
-such as <i>Lu</i> and <i>Nd</i> are supported. Details are given in the
+Including Unicode property support adds around 30K of tables to the PCRE
+library. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
+supported. Details are given in the
 <a href="pcrepattern.html"><b>pcrepattern</b></a>
 documentation.
 </P>
-<br><a name="SEC4" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
+<br><a name="SEC5" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
 <P>
-By default, PCRE treats character 10 (linefeed) as the newline character. This
-is the normal newline character on Unix-like systems. You can compile PCRE to
-use character 13 (carriage return) instead by adding
+By default, PCRE interprets character 10 (linefeed, LF) as indicating the end
+of a line. This is the normal newline character on Unix-like systems. You can
+compile PCRE to use character 13 (carriage return, CR) instead, by adding
 <pre>
   --enable-newline-is-cr
 </pre>
-to the <b>configure</b> command. For completeness there is also a
---enable-newline-is-lf option, which explicitly specifies linefeed as the
-newline character.
+to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
+which explicitly specifies linefeed as the newline character.
+<br>
+<br>
+Alternatively, you can specify that line endings are to be indicated by the two
+character sequence CRLF. If you want this, add
+<pre>
+  --enable-newline-is-crlf
+</pre>
+to the <b>configure</b> command. There is a fourth option, specified by
+<pre>
+  --enable-newline-is-anycrlf
+</pre>
+which causes PCRE to recognize any of the three sequences CR, LF, or CRLF as
+indicating a line ending. Finally, a fifth option, specified by
+<pre>
+  --enable-newline-is-any
+</pre>
+causes PCRE to recognize any Unicode newline sequence.
 </P>
-<br><a name="SEC5" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
+<P>
+Whatever line ending convention is selected when PCRE is built can be
+overridden when the library functions are called. At build time it is
+conventional to use the standard for your operating system.
+</P>
+<br><a name="SEC6" href="#TOC1">WHAT \R MATCHES</a><br>
+<P>
+By default, the sequence \R in a pattern matches any Unicode newline sequence,
+whatever has been selected as the line ending sequence. If you specify
+<pre>
+  --enable-bsr-anycrlf
+</pre>
+the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
+selected when PCRE is built can be overridden when the library functions are
+called.
+</P>
+<br><a name="SEC7" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
 <P>
 The PCRE building process uses <b>libtool</b> to build both shared and static
 Unix libraries by default. You can suppress one of these by adding one of
@@ -94,7 +148,7 @@
 </pre>
 to the <b>configure</b> command, as required.
 </P>
-<br><a name="SEC6" href="#TOC1">POSIX MALLOC USAGE</a><br>
+<br><a name="SEC8" href="#TOC1">POSIX MALLOC USAGE</a><br>
 <P>
 When PCRE is called through the POSIX interface (see the
 <a href="pcreposix.html"><b>pcreposix</b></a>
@@ -110,22 +164,7 @@
 </pre>
 to the <b>configure</b> command.
 </P>
-<br><a name="SEC7" href="#TOC1">LIMITING PCRE RESOURCE USAGE</a><br>
-<P>
-Internally, PCRE has a function called <b>match()</b>, which it calls repeatedly
-(possibly recursively) when matching a pattern. By controlling the maximum
-number of times this function may be called during a single matching operation,
-a limit can be placed on the resources used by a single call to
-<b>pcre_exec()</b>. The limit can be changed at run time, as described in the
-<a href="pcreapi.html"><b>pcreapi</b></a>
-documentation. The default is 10 million, but this can be changed by adding a
-setting such as
-<pre>
-  --with-match-limit=500000
-</pre>
-to the <b>configure</b> command.
-</P>
-<br><a name="SEC8" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
+<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
 <P>
 Within a compiled pattern, offset values are used to point from one part to
 another (for example, from an opening parenthesis to an alternation
@@ -141,46 +180,115 @@
 longer offsets slows down the operation of PCRE because it has to load
 additional bytes when handling them.
 </P>
+<br><a name="SEC10" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
 <P>
-If you build PCRE with an increased link size, test 2 (and test 5 if you are
-using UTF-8) will fail. Part of the output of these tests is a representation
-of the compiled pattern, and this changes with the link size.
-</P>
-<br><a name="SEC9" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
-<P>
-PCRE implements backtracking while matching by making recursive calls to an
-internal function called <b>match()</b>. In environments where the size of the
-stack is limited, this can severely limit PCRE's operation. (The Unix
-environment does not usually suffer from this problem.) An alternative approach
-that uses memory from the heap to remember data, instead of using recursive
-function calls, has been implemented to work round this problem. If you want to
+When matching with the <b>pcre_exec()</b> function, PCRE implements backtracking
+by making recursive calls to an internal function called <b>match()</b>. In
+environments where the size of the stack is limited, this can severely limit
+PCRE's operation. (The Unix environment does not usually suffer from this
+problem, but it may sometimes be necessary to increase the maximum stack size.
+There is a discussion in the
+<a href="pcrestack.html"><b>pcrestack</b></a>
+documentation.) An alternative approach to recursion that uses memory from the
+heap to remember data, instead of using recursive function calls, has been
+implemented to work round the problem of limited stack size. If you want to
 build a version of PCRE that works this way, add
 <pre>
   --disable-stack-for-recursion
 </pre>
 to the <b>configure</b> command. With this configuration, PCRE will use the
 <b>pcre_stack_malloc</b> and <b>pcre_stack_free</b> variables to call memory
-management functions. Separate functions are provided because the usage is very
-predictable: the block sizes requested are always the same, and the blocks are
-always freed in reverse order. A calling program might be able to implement
-optimized functions that perform better than the standard <b>malloc()</b> and
-<b>free()</b> functions. PCRE runs noticeably more slowly when built in this
-way.
+management functions. By default these point to <b>malloc()</b> and
+<b>free()</b>, but you can replace the pointers so that your own functions are
+used.
+</P>
+<P>
+Separate functions are provided rather than using <b>pcre_malloc</b> and
+<b>pcre_free</b> because the usage is very predictable: the block sizes
+requested are always the same, and the blocks are always freed in reverse
+order. A calling program might be able to implement optimized functions that
+perform better than <b>malloc()</b> and <b>free()</b>. PCRE runs noticeably more
+slowly when built in this way. This option affects only the <b>pcre_exec()</b>
+function; it is not relevant for the the <b>pcre_dfa_exec()</b> function.
 </P>
-<br><a name="SEC10" href="#TOC1">USING EBCDIC CODE</a><br>
+<br><a name="SEC11" href="#TOC1">LIMITING PCRE RESOURCE USAGE</a><br>
+<P>
+Internally, PCRE has a function called <b>match()</b>, which it calls repeatedly
+(sometimes recursively) when matching a pattern with the <b>pcre_exec()</b>
+function. By controlling the maximum number of times this function may be
+called during a single matching operation, a limit can be placed on the
+resources used by a single call to <b>pcre_exec()</b>. The limit can be changed
+at run time, as described in the
+<a href="pcreapi.html"><b>pcreapi</b></a>
+documentation. The default is 10 million, but this can be changed by adding a
+setting such as
+<pre>
+  --with-match-limit=500000
+</pre>
+to the <b>configure</b> command. This setting has no effect on the
+<b>pcre_dfa_exec()</b> matching function.
+</P>
+<P>
+In some environments it is desirable to limit the depth of recursive calls of
+<b>match()</b> more strictly than the total number of calls, in order to
+restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
+is specified) that is used. A second limit controls this; it defaults to the
+value that is set for --with-match-limit, which imposes no additional
+constraints. However, you can set a lower limit by adding, for example,
+<pre>
+  --with-match-limit-recursion=10000
+</pre>
+to the <b>configure</b> command. This value can also be overridden at run time.
+</P>
+<br><a name="SEC12" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
+<P>
+PCRE uses fixed tables for processing characters whose code values are less
+than 256. By default, PCRE is built with a set of tables that are distributed
+in the file <i>pcre_chartables.c.dist</i>. These tables are for ASCII codes
+only. If you add
+<pre>
+  --enable-rebuild-chartables
+</pre>
+to the <b>configure</b> command, the distributed tables are no longer used.
+Instead, a program called <b>dftables</b> is compiled and run. This outputs the
+source for new set of tables, created in the default locale of your C runtime
+system. (This method of replacing the tables does not work if you are cross
+compiling, because <b>dftables</b> is run on the local host. If you need to
+create alternative tables when cross compiling, you will have to do so "by
+hand".)
+</P>
+<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
 <P>
 PCRE assumes by default that it will run in an environment where the character
-code is ASCII (or Unicode, which is a superset of ASCII). PCRE can, however, be
-compiled to run in an EBCDIC environment by adding
+code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
+most computer operating systems. PCRE can, however, be compiled to run in an
+EBCDIC environment by adding
 <pre>
   --enable-ebcdic
 </pre>
-to the <b>configure</b> command.
+to the <b>configure</b> command. This setting implies
+--enable-rebuild-chartables. You should only use it if you know that you are in
+an EBCDIC environment (for example, an IBM mainframe operating system).
+</P>
+<br><a name="SEC14" href="#TOC1">SEE ALSO</a><br>
+<P>
+<b>pcreapi</b>(3), <b>pcre_config</b>(3).
 </P>
+<br><a name="SEC15" href="#TOC1">AUTHOR</a><br>
 <P>
-Last updated: 09 September 2004
+Philip Hazel
+<br>
+University Computing Service
+<br>
+Cambridge CB2 3QH, England.
+<br>
+</P>
+<br><a name="SEC16" href="#TOC1">REVISION</a><br>
+<P>
+Last updated: 21 September 2007
+<br>
+Copyright &copy; 1997-2007 University of Cambridge.
 <br>
-Copyright &copy; 1997-2004 University of Cambridge.
 <p>
 Return to the <a href="index.html">PCRE index page</a>.
 </p>

Modified: httpd/httpd/vendor/pcre/current/doc/html/pcrecallout.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcrecallout.html?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcrecallout.html (original)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcrecallout.html Mon Nov 26 08:49:53 2007
@@ -17,6 +17,8 @@
 <li><a name="TOC2" href="#SEC2">MISSING CALLOUTS</a>
 <li><a name="TOC3" href="#SEC3">THE CALLOUT INTERFACE</a>
 <li><a name="TOC4" href="#SEC4">RETURN VALUES</a>
+<li><a name="TOC5" href="#SEC5">AUTHOR</a>
+<li><a name="TOC6" href="#SEC6">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">PCRE CALLOUTS</a><br>
 <P>
@@ -35,7 +37,7 @@
 a number less than 256 after the letter C. The default value is zero.
 For example, this pattern has two callout points:
 <pre>
-  (?C1)\deabc(?C2)def
+  (?C1)abc(?C2)def
 </pre>
 If the PCRE_AUTO_CALLOUT option bit is set when <b>pcre_compile()</b> is called,
 PCRE automatically inserts callouts, all with number 255, before each item in
@@ -72,9 +74,10 @@
 <br><a name="SEC3" href="#TOC1">THE CALLOUT INTERFACE</a><br>
 <P>
 During matching, when PCRE reaches a callout point, the external function
-defined by <i>pcre_callout</i> is called (if it is set). The only argument is a
-pointer to a <b>pcre_callout</b> block. This structure contains the following
-fields:
+defined by <i>pcre_callout</i> is called (if it is set). This applies to both
+the <b>pcre_exec()</b> and the <b>pcre_dfa_exec()</b> matching functions. The
+only argument to the callout function is a pointer to a <b>pcre_callout</b>
+block. This structure contains the following fields:
 <pre>
   int          <i>version</i>;
   int          <i>callout_number</i>;
@@ -101,40 +104,47 @@
 </P>
 <P>
 The <i>offset_vector</i> field is a pointer to the vector of offsets that was
-passed by the caller to <b>pcre_exec()</b>. The contents can be inspected in
-order to extract substrings that have been matched so far, in the same way as
-for extracting substrings after a match has completed.
+passed by the caller to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. When
+<b>pcre_exec()</b> is used, the contents can be inspected in order to extract
+substrings that have been matched so far, in the same way as for extracting
+substrings after a match has completed. For <b>pcre_dfa_exec()</b> this field is
+not useful.
 </P>
 <P>
 The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
 that were passed to <b>pcre_exec()</b>.
 </P>
 <P>
-The <i>start_match</i> field contains the offset within the subject at which the
-current match attempt started. If the pattern is not anchored, the callout
-function may be called several times from the same point in the pattern for
-different starting points in the subject.
+The <i>start_match</i> field normally contains the offset within the subject at
+which the current match attempt started. However, if the escape sequence \K
+has been encountered, this value is changed to reflect the modified starting
+point. If the pattern is not anchored, the callout function may be called
+several times from the same point in the pattern for different starting points
+in the subject.
 </P>
 <P>
 The <i>current_position</i> field contains the offset within the subject of the
 current match pointer.
 </P>
 <P>
-The <i>capture_top</i> field contains one more than the number of the highest
-numbered captured substring so far. If no substrings have been captured,
-the value of <i>capture_top</i> is one.
+When the <b>pcre_exec()</b> function is used, the <i>capture_top</i> field
+contains one more than the number of the highest numbered captured substring so
+far. If no substrings have been captured, the value of <i>capture_top</i> is
+one. This is always the case when <b>pcre_dfa_exec()</b> is used, because it
+does not support captured substrings.
 </P>
 <P>
 The <i>capture_last</i> field contains the number of the most recently captured
-substring. If no substrings have been captured, its value is -1.
+substring. If no substrings have been captured, its value is -1. This is always
+the case when <b>pcre_dfa_exec()</b> is used.
 </P>
 <P>
 The <i>callout_data</i> field contains a value that is passed to
-<b>pcre_exec()</b> by the caller specifically so that it can be passed back in
-callouts. It is passed in the <i>pcre_callout</i> field of the <b>pcre_extra</b>
-data structure. If no such data was passed, the value of <i>callout_data</i> in
-a <b>pcre_callout</b> block is NULL. There is a description of the
-<b>pcre_extra</b> structure in the
+<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> specifically so that it can be
+passed back in callouts. It is passed in the <i>pcre_callout</i> field of the
+<b>pcre_extra</b> data structure. If no such data was passed, the value of
+<i>callout_data</i> in a <b>pcre_callout</b> block is NULL. There is a
+description of the <b>pcre_extra</b> structure in the
 <a href="pcreapi.html"><b>pcreapi</b></a>
 documentation.
 </P>
@@ -160,10 +170,10 @@
 <P>
 The external callout function returns an integer to PCRE. If the value is zero,
 matching proceeds as normal. If the value is greater than zero, matching fails
-at the current point, but backtracking to test other matching possibilities
-goes ahead, just as if a lookahead assertion had failed. If the value is less
-than zero, the match is abandoned, and <b>pcre_exec()</b> returns the negative
-value.
+at the current point, but the testing of other matching possibilities goes
+ahead, just as if a lookahead assertion had failed. If the value is less than
+zero, the match is abandoned, and <b>pcre_exec()</b> (or <b>pcre_dfa_exec()</b>)
+returns the negative value.
 </P>
 <P>
 Negative values should normally be chosen from the set of PCRE_ERROR_xxx
@@ -171,10 +181,21 @@
 The error number PCRE_ERROR_CALLOUT is reserved for use by callout functions;
 it will never be used by PCRE itself.
 </P>
+<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
 <P>
-Last updated: 09 September 2004
+Philip Hazel
+<br>
+University Computing Service
+<br>
+Cambridge CB2 3QH, England.
+<br>
+</P>
+<br><a name="SEC6" href="#TOC1">REVISION</a><br>
+<P>
+Last updated: 29 May 2007
+<br>
+Copyright &copy; 1997-2007 University of Cambridge.
 <br>
-Copyright &copy; 1997-2004 University of Cambridge.
 <p>
 Return to the <a href="index.html">PCRE index page</a>.
 </p>

Modified: httpd/httpd/vendor/pcre/current/doc/html/pcrecompat.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcrecompat.html?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcrecompat.html (original)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcrecompat.html Mon Nov 26 08:49:53 2007
@@ -17,12 +17,13 @@
 </b><br>
 <P>
 This document describes the differences in the ways that PCRE and Perl handle
-regular expressions. The differences described here are with respect to Perl
-5.8.
+regular expressions. The differences described here are mainly with respect to
+Perl 5.8, though PCRE versions 7.0 and later contain some features that are
+expected to be in the forthcoming Perl 5.10.
 </P>
 <P>
-1. PCRE does not have full UTF-8 support. Details of what it does have are
-given in the
+1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what
+it does have are given in the
 <a href="pcre.html#utf8support">section on UTF-8 support</a>
 in the main
 <a href="pcre.html"><b>pcre</b></a>
@@ -57,7 +58,8 @@
 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE is
 built with Unicode character property support. The properties that can be
 tested with \p and \P are limited to the general category properties such as
-Lu and Nd.
+Lu and Nd, script names such as Greek or Han, and the derived properties Any
+and L&.
 </P>
 <P>
 7. PCRE does support the \Q...\E escape for quoting substrings. Characters in
@@ -75,20 +77,34 @@
 The \Q...\E sequence is recognized both inside and outside character classes.
 </P>
 <P>
-8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
-constructions. However, there is support for recursive patterns using the
-non-Perl items (?R), (?number), and (?P&#62;name). Also, the PCRE "callout" feature
-allows an external function to be called during pattern matching. See the
+8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
+constructions. However, there is support for recursive patterns. This is not
+available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE "callout"
+feature allows an external function to be called during pattern matching. See
+the
 <a href="pcrecallout.html"><b>pcrecallout</b></a>
 documentation for details.
 </P>
 <P>
-9. There are some differences that are concerned with the settings of captured
+9. Subpatterns that are called recursively or as "subroutines" are always
+treated as atomic groups in PCRE. This is like Python, but unlike Perl.
+</P>
+<P>
+10. There are some differences that are concerned with the settings of captured
 strings when part of a pattern is repeated. For example, matching "aba" against
 the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
 </P>
 <P>
-10. PCRE provides some extensions to the Perl regular expression facilities:
+11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), (*FAIL), (*F),
+(*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an
+argument. PCRE does not support (*MARK). If (*ACCEPT) is within capturing
+parentheses, PCRE does not set that capture group; this is different to Perl.
+</P>
+<P>
+12. PCRE provides some extensions to the Perl regular expression facilities.
+Perl 5.10 will include new features that are not in earlier versions, some of
+which (such as named parentheses) have been in PCRE for some time. This list is
+with respect to Perl 5.10:
 <br>
 <br>
 (a) Although lookbehind assertions must match fixed length strings, each
@@ -101,7 +117,8 @@
 <br>
 <br>
 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no special
-meaning is faulted.
+meaning is faulted. Otherwise, like Perl, the backslash is quietly ignored.
+(Perl can be made to issue a warning.)
 <br>
 <br>
 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is
@@ -117,34 +134,46 @@
 options for <b>pcre_exec()</b> have no Perl equivalents.
 <br>
 <br>
-(g) The (?R), (?number), and (?P&#62;name) constructs allows for recursive pattern
-matching (Perl can do this using the (?p{code}) construct, which PCRE cannot
-support.)
+(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
+by the PCRE_BSR_ANYCRLF option.
+<br>
 <br>
+(h) The callout facility is PCRE-specific.
 <br>
-(h) PCRE supports named capturing substrings, using the Python syntax.
 <br>
+(i) The partial matching facility is PCRE-specific.
 <br>
-(i) PCRE supports the possessive quantifier "++" syntax, taken from Sun's Java
-package.
 <br>
+(j) Patterns compiled by PCRE can be saved and re-used at a later time, even on
+different hosts that have the other endianness.
 <br>
-(j) The (R) condition, for testing recursion, is a PCRE extension.
 <br>
+(k) The alternative matching function (<b>pcre_dfa_exec()</b>) matches in a
+different way and is not Perl-compatible.
 <br>
-(k) The callout facility is PCRE-specific.
 <br>
+(l) PCRE recognizes some special sequences such as (*CR) at the start of
+a pattern that set overall options that cannot be changed within the pattern.
+</P>
+<br><b>
+AUTHOR
+</b><br>
+<P>
+Philip Hazel
 <br>
-(l) The partial matching facility is PCRE-specific.
+University Computing Service
 <br>
+Cambridge CB2 3QH, England.
 <br>
-(m) Patterns compiled by PCRE can be saved and re-used at a later time, even on
-different hosts that have the other endianness.
 </P>
+<br><b>
+REVISION
+</b><br>
 <P>
-Last updated: 09 September 2004
+Last updated: 11 September 2007
+<br>
+Copyright &copy; 1997-2007 University of Cambridge.
 <br>
-Copyright &copy; 1997-2004 University of Cambridge.
 <p>
 Return to the <a href="index.html">PCRE index page</a>.
 </p>

Added: httpd/httpd/vendor/pcre/current/doc/html/pcrecpp.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcrecpp.html?rev=598339&view=auto
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcrecpp.html (added)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcrecpp.html Mon Nov 26 08:49:53 2007
@@ -0,0 +1,364 @@
+<html>
+<head>
+<title>pcrecpp specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcrecpp man page</h1>
+<p>
+Return to the <a href="index.html">PCRE index page</a>.
+</p>
+<p>
+This page is part of the PCRE HTML documentation. It was generated automatically
+from the original man page. If there is any nonsense in it, please consult the
+man page, in case the conversion went wrong.
+<br>
+<ul>
+<li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a>
+<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
+<li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a>
+<li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a>
+<li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a>
+<li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a>
+<li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a>
+<li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a>
+<li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a>
+<li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a>
+<li><a name="TOC11" href="#SEC11">AUTHOR</a>
+<li><a name="TOC12" href="#SEC12">REVISION</a>
+</ul>
+<br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br>
+<P>
+<b>#include &#60;pcrecpp.h&#62;</b>
+</P>
+<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
+<P>
+The C++ wrapper for PCRE was provided by Google Inc. Some additional
+functionality was added by Giuseppe Maxia. This brief man page was constructed
+from the notes in the <i>pcrecpp.h</i> file, which should be consulted for
+further details.
+</P>
+<br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br>
+<P>
+The "FullMatch" operation checks that supplied text matches a supplied pattern
+exactly. If pointer arguments are supplied, it copies matched sub-strings that
+match sub-patterns into them.
+<pre>
+  Example: successful match
+     pcrecpp::RE re("h.*o");
+     re.FullMatch("hello");
+
+  Example: unsuccessful match (requires full match):
+     pcrecpp::RE re("e");
+     !re.FullMatch("hello");
+
+  Example: creating a temporary RE object:
+     pcrecpp::RE("h.*o").FullMatch("hello");
+</pre>
+You can pass in a "const char*" or a "string" for "text". The examples below
+tend to use a const char*. You can, as in the different examples above, store
+the RE object explicitly in a variable or use a temporary RE object. The
+examples below use one mode or the other arbitrarily. Either could correctly be
+used for any of these examples.
+</P>
+<P>
+You must supply extra pointer arguments to extract matched subpieces.
+<pre>
+  Example: extracts "ruby" into "s" and 1234 into "i"
+     int i;
+     string s;
+     pcrecpp::RE re("(\\w+):(\\d+)");
+     re.FullMatch("ruby:1234", &s, &i);
+
+  Example: does not try to extract any extra sub-patterns
+     re.FullMatch("ruby:1234", &s);
+
+  Example: does not try to extract into NULL
+     re.FullMatch("ruby:1234", NULL, &i);
+
+  Example: integer overflow causes failure
+     !re.FullMatch("ruby:1234567891234", NULL, &i);
+
+  Example: fails because there aren't enough sub-patterns:
+     !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
+
+  Example: fails because string cannot be stored in integer
+     !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
+</pre>
+The provided pointer arguments can be pointers to any scalar numeric
+type, or one of:
+<pre>
+   string        (matched piece is copied to string)
+   StringPiece   (StringPiece is mutated to point to matched piece)
+   T             (where "bool T::ParseFrom(const char*, int)" exists)
+   NULL          (the corresponding matched sub-pattern is not copied)
+</pre>
+The function returns true iff all of the following conditions are satisfied:
+<pre>
+  a. "text" matches "pattern" exactly;
+
+  b. The number of matched sub-patterns is &#62;= number of supplied
+     pointers;
+
+  c. The "i"th argument has a suitable type for holding the
+     string captured as the "i"th sub-pattern. If you pass in
+     NULL for the "i"th argument, or pass fewer arguments than
+     number of sub-patterns, "i"th captured sub-pattern is
+     ignored.
+</pre>
+CAVEAT: An optional sub-pattern that does not exist in the matched
+string is assigned the empty string. Therefore, the following will
+return false (because the empty string is not a valid number):
+<pre>
+   int number;
+   pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
+</pre>
+The matching interface supports at most 16 arguments per call.
+If you need more, consider using the more general interface
+<b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for
+<b>DoMatch</b>.
+</P>
+<br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br>
+<P>
+You can use the "QuoteMeta" operation to insert backslashes before all
+potentially meaningful characters in a string. The returned string, used as a
+regular expression, will exactly match the original string.
+<pre>
+  Example:
+     string quoted = RE::QuoteMeta(unquoted);
+</pre>
+Note that it's legal to escape a character even if it has no special meaning in
+a regular expression -- so this function does that. (This also makes it
+identical to the perl function of the same name; see "perldoc -f quotemeta".)
+For example, "1.5-2.0?" becomes "1\.5\-2\.0\?".
+</P>
+<br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br>
+<P>
+You can use the "PartialMatch" operation when you want the pattern
+to match any substring of the text.
+<pre>
+  Example: simple search for a string:
+     pcrecpp::RE("ell").PartialMatch("hello");
+
+  Example: find first number in a string:
+     int number;
+     pcrecpp::RE re("(\\d+)");
+     re.PartialMatch("x*100 + 20", &number);
+     assert(number == 100);
+</PRE>
+</P>
+<br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br>
+<P>
+By default, pattern and text are plain text, one byte per character. The UTF8
+flag, passed to the constructor, causes both pattern and string to be treated
+as UTF-8 text, still a byte stream but potentially multiple bytes per
+character. In practice, the text is likelier to be UTF-8 than the pattern, but
+the match returned may depend on the UTF8 flag, so always use it when matching
+UTF8 text. For example, "." will match one byte normally but with UTF8 set may
+match up to three bytes of a multi-byte character.
+<pre>
+  Example:
+     pcrecpp::RE_Options options;
+     options.set_utf8();
+     pcrecpp::RE re(utf8_pattern, options);
+     re.FullMatch(utf8_string);
+
+  Example: using the convenience function UTF8():
+     pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
+     re.FullMatch(utf8_string);
+</pre>
+NOTE: The UTF8 flag is ignored if pcre was not configured with the
+<pre>
+      --enable-utf8 flag.
+</PRE>
+</P>
+<br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br>
+<P>
+PCRE defines some modifiers to change the behavior of the regular expression
+engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
+pass such modifiers to a RE class. Currently, the following modifiers are
+supported:
+<pre>
+   modifier              description               Perl corresponding
+
+   PCRE_CASELESS         case insensitive match      /i
+   PCRE_MULTILINE        multiple lines match        /m
+   PCRE_DOTALL           dot matches newlines        /s
+   PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
+   PCRE_EXTRA            strict escape parsing       N/A
+   PCRE_EXTENDED         ignore whitespaces          /x
+   PCRE_UTF8             handles UTF8 chars          built-in
+   PCRE_UNGREEDY         reverses * and *?           N/A
+   PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
+</pre>
+(*) Both Perl and PCRE allow non capturing parentheses by means of the
+"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
+capture, while (ab|cd) does.
+</P>
+<P>
+For a full account on how each modifier works, please check the
+PCRE API reference page.
+</P>
+<P>
+For each modifier, there are two member functions whose name is made
+out of the modifier in lowercase, without the "PCRE_" prefix. For
+instance, PCRE_CASELESS is handled by
+<pre>
+  bool caseless()
+</pre>
+which returns true if the modifier is set, and
+<pre>
+  RE_Options & set_caseless(bool)
+</pre>
+which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
+accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member
+functions. Setting <i>match_limit</i> to a non-zero value will limit the
+execution of pcre to keep it from doing bad things like blowing the stack or
+taking an eternity to return a result. A value of 5000 is good enough to stop
+stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables
+match limiting. Alternatively, you can call <b>match_limit_recursion()</b>
+which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
+recurses. <b>match_limit()</b> limits the number of matches PCRE does;
+<b>match_limit_recursion()</b> limits the depth of internal recursion, and
+therefore the amount of stack that is used.
+</P>
+<P>
+Normally, to pass one or more modifiers to a RE class, you declare
+a <i>RE_Options</i> object, set the appropriate options, and pass this
+object to a RE constructor. Example:
+<pre>
+   RE_options opt;
+   opt.set_caseless(true);
+   if (RE("HELLO", opt).PartialMatch("hello world")) ...
+</pre>
+RE_options has two constructors. The default constructor takes no arguments and
+creates a set of flags that are off by default. The optional parameter
+<i>option_flags</i> is to facilitate transfer of legacy code from C programs.
+This lets you do
+<pre>
+   RE(pattern,
+     RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
+</pre>
+However, new code is better off doing
+<pre>
+   RE(pattern,
+     RE_Options().set_caseless(true).set_multiline(true))
+       .PartialMatch(str);
+</pre>
+If you are going to pass one of the most used modifiers, there are some
+convenience functions that return a RE_Options class with the
+appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>,
+<b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>.
+</P>
+<P>
+If you need to set several options at once, and you don't want to go through
+the pains of declaring a RE_Options object and setting several options, there
+is a parallel method that give you such ability on the fly. You can concatenate
+several <b>set_xxxxx()</b> member functions, since each of them returns a
+reference to its class object. For example, to pass PCRE_CASELESS,
+PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
+<pre>
+   RE(" ^ xyz \\s+ .* blah$",
+     RE_Options()
+       .set_caseless(true)
+       .set_extended(true)
+       .set_multiline(true)).PartialMatch(sometext);
+
+</PRE>
+</P>
+<br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br>
+<P>
+The "Consume" operation may be useful if you want to repeatedly
+match regular expressions at the front of a string and skip over
+them as they match. This requires use of the "StringPiece" type,
+which represents a sub-range of a real string. Like RE, StringPiece
+is defined in the pcrecpp namespace.
+<pre>
+  Example: read lines of the form "var = value" from a string.
+     string contents = ...;                 // Fill string somehow
+     pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
+</PRE>
+</P>
+<P>
+<pre>
+     string var;
+     int value;
+     pcrecpp::RE re("(\\w+) = (\\d+)\n");
+     while (re.Consume(&input, &var, &value)) {
+       ...;
+     }
+</pre>
+Each successful call to "Consume" will set "var/value", and also
+advance "input" so it points past the matched text.
+</P>
+<P>
+The "FindAndConsume" operation is similar to "Consume" but does not
+anchor your match at the beginning of the string. For example, you
+could extract all words from a string by repeatedly calling
+<pre>
+  pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
+</PRE>
+</P>
+<br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br>
+<P>
+By default, if you pass a pointer to a numeric value, the
+corresponding text is interpreted as a base-10 number. You can
+instead wrap the pointer with a call to one of the operators Hex(),
+Octal(), or CRadix() to interpret the text in another base. The
+CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
+prefixes, but defaults to base-10.
+<pre>
+  Example:
+    int a, b, c, d;
+    pcrecpp::RE re("(.*) (.*) (.*) (.*)");
+    re.FullMatch("100 40 0100 0x40",
+                 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
+                 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
+</pre>
+will leave 64 in a, b, c, and d.
+</P>
+<br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br>
+<P>
+You can replace the first match of "pattern" in "str" with "rewrite".
+Within "rewrite", backslash-escaped digits (\1 to \9) can be
+used to insert text matching corresponding parenthesized group
+from the pattern. \0 in "rewrite" refers to the entire matching
+text. For example:
+<pre>
+  string s = "yabba dabba doo";
+  pcrecpp::RE("b+").Replace("d", &s);
+</pre>
+will leave "s" containing "yada dabba doo". The result is true if the pattern
+matches and a replacement occurs, false otherwise.
+</P>
+<P>
+<b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all
+occurrences of the pattern in the string with the rewrite. Replacements are
+not subject to re-matching. For example:
+<pre>
+  string s = "yabba dabba doo";
+  pcrecpp::RE("b+").GlobalReplace("d", &s);
+</pre>
+will leave "s" containing "yada dada doo". It returns the number of
+replacements made.
+</P>
+<P>
+<b>Extract</b> is like <b>Replace</b>, except that if the pattern matches,
+"rewrite" is copied into "out" (an additional argument) with substitutions.
+The non-matching portions of "text" are ignored. Returns true iff a match
+occurred and the extraction happened successfully;  if no match occurs, the
+string is left unaffected.
+</P>
+<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
+<P>
+The C++ wrapper was contributed by Google Inc.
+<br>
+Copyright &copy; 2007 Google Inc.
+<br>
+</P>
+<br><a name="SEC12" href="#TOC1">REVISION</a><br>
+<P>
+Last updated: 06 March 2007
+<br>
+<p>
+Return to the <a href="index.html">PCRE index page</a>.
+</p>

Modified: httpd/httpd/vendor/pcre/current/doc/html/pcregrep.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcregrep.html?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcregrep.html (original)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcregrep.html Mon Nov 26 08:49:53 2007
@@ -16,143 +16,426 @@
 <li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
 <li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
 <li><a name="TOC3" href="#SEC3">OPTIONS</a>
-<li><a name="TOC4" href="#SEC4">LONG OPTIONS</a>
-<li><a name="TOC5" href="#SEC5">DIAGNOSTICS</a>
-<li><a name="TOC6" href="#SEC6">AUTHOR</a>
+<li><a name="TOC4" href="#SEC4">ENVIRONMENT VARIABLES</a>
+<li><a name="TOC5" href="#SEC5">NEWLINES</a>
+<li><a name="TOC6" href="#SEC6">OPTIONS COMPATIBILITY</a>
+<li><a name="TOC7" href="#SEC7">OPTIONS WITH DATA</a>
+<li><a name="TOC8" href="#SEC8">MATCHING ERRORS</a>
+<li><a name="TOC9" href="#SEC9">DIAGNOSTICS</a>
+<li><a name="TOC10" href="#SEC10">SEE ALSO</a>
+<li><a name="TOC11" href="#SEC11">AUTHOR</a>
+<li><a name="TOC12" href="#SEC12">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
 <P>
-<b>pcregrep [-Vcfhilnrsuvx] [long options] [pattern] [file1 file2 ...]</b>
+<b>pcregrep [options] [long options] [pattern] [path1 path2 ...]</b>
 </P>
 <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
 <P>
 <b>pcregrep</b> searches files for character patterns, in the same way as other
 grep commands do, but it uses the PCRE regular expression library to support
 patterns that are compatible with the regular expressions of Perl 5. See
-<a href="pcrepattern.html"><b>pcrepattern</b></a>
-for a full description of syntax and semantics of the regular expressions that
-PCRE supports.
+<a href="pcrepattern.html"><b>pcrepattern</b>(3)</a>
+for a full description of syntax and semantics of the regular expressions
+that PCRE supports.
 </P>
 <P>
-A pattern must be specified on the command line unless the <b>-f</b> option is
-used (see below).
+Patterns, whether supplied on the command line or in a separate file, are given
+without delimiters. For example:
+<pre>
+  pcregrep Thursday /etc/motd
+</pre>
+If you attempt to use delimiters (for example, by surrounding a pattern with
+slashes, as is common in Perl scripts), they are interpreted as part of the
+pattern. Quotes can of course be used on the command line because they are
+interpreted by the shell, and indeed they are required if a pattern contains
+white space or shell metacharacters.
+</P>
+<P>
+The first argument that follows any option settings is treated as the single
+pattern to be matched when neither <b>-e</b> nor <b>-f</b> is present.
+Conversely, when one or both of these options are used to specify patterns, all
+arguments are treated as path names. At least one of <b>-e</b>, <b>-f</b>, or an
+argument pattern must be provided.
+</P>
+<P>
+If no files are specified, <b>pcregrep</b> reads the standard input. The
+standard input can also be referenced by a name consisting of a single hyphen.
+For example:
+<pre>
+  pcregrep some-pattern /file1 - /file3
+</pre>
+By default, each line that matches the pattern is copied to the standard
+output, and if there is more than one file, the file name is output at the
+start of each line. However, there are options that can change how
+<b>pcregrep</b> behaves. In particular, the <b>-M</b> option makes it possible to
+search for patterns that span line boundaries. What defines a line boundary is
+controlled by the <b>-N</b> (<b>--newline</b>) option.
 </P>
 <P>
-If no files are specified, <b>pcregrep</b> reads the standard input. By default,
-each line that matches the pattern is copied to the standard output, and if
-there is more than one file, the file name is printed before each line of
-output. However, there are options that can change how <b>pcregrep</b> behaves.
+Patterns are limited to 8K or BUFSIZ characters, whichever is the greater.
+BUFSIZ is defined in <b>&#60;stdio.h&#62;</b>.
 </P>
 <P>
-Lines are limited to BUFSIZ characters. BUFSIZ is defined in <b>&#60;stdio.h&#62;</b>.
-The newline character is removed from the end of each line before it is matched
-against the pattern.
+If the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variable is set,
+<b>pcregrep</b> uses the value to set a locale when calling the PCRE library.
+The <b>--locale</b> option can be used to override this.
 </P>
 <br><a name="SEC3" href="#TOC1">OPTIONS</a><br>
 <P>
-<b>-V</b>
-Write the version number of the PCRE library being used to the standard error
-stream.
+<b>--</b>
+This terminate the list of options. It is useful if the next item on the
+command line starts with a hyphen but is not an option. This allows for the
+processing of patterns and filenames that start with hyphens.
+</P>
+<P>
+<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
+Output <i>number</i> lines of context after each matching line. If filenames
+and/or line numbers are being output, a hyphen separator is used instead of a
+colon for the context lines. A line containing "--" is output between each
+group of lines, unless they are in fact contiguous in the input file. The value
+of <i>number</i> is expected to be relatively small. However, <b>pcregrep</b>
+guarantees to have up to 8K of following text available for context output.
+</P>
+<P>
+<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
+Output <i>number</i> lines of context before each matching line. If filenames
+and/or line numbers are being output, a hyphen separator is used instead of a
+colon for the context lines. A line containing "--" is output between each
+group of lines, unless they are in fact contiguous in the input file. The value
+of <i>number</i> is expected to be relatively small. However, <b>pcregrep</b>
+guarantees to have up to 8K of preceding text available for context output.
+</P>
+<P>
+<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
+Output <i>number</i> lines of context both before and after each matching line.
+This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
+</P>
+<P>
+<b>-c</b>, <b>--count</b>
+Do not output individual lines; instead just output a count of the number of
+lines that would otherwise have been output. If several files are given, a
+count is output for each of them. In this mode, the <b>-A</b>, <b>-B</b>, and
+<b>-C</b> options are ignored.
+</P>
+<P>
+<b>--colour</b>, <b>--color</b>
+If this option is given without any data, it is equivalent to "--colour=auto".
+If data is required, it must be given in the same shell item, separated by an
+equals sign.
+</P>
+<P>
+<b>--colour=</b><i>value</i>, <b>--color=</b><i>value</i>
+This option specifies under what circumstances the part of a line that matched
+a pattern should be coloured in the output. The value may be "never" (the
+default), "always", or "auto". In the latter case, colouring happens only if
+the standard output is connected to a terminal. The colour can be specified by
+setting the environment variable PCREGREP_COLOUR or PCREGREP_COLOR. The value
+of this variable should be a string of two numbers, separated by a semicolon.
+They are copied directly into the control string for setting colour on a
+terminal, so it is your responsibility to ensure that they make sense. If
+neither of the environment variables is set, the default is "1;31", which gives
+red.
+</P>
+<P>
+<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
+If an input path is not a regular file or a directory, "action" specifies how
+it is to be processed. Valid values are "read" (the default) or "skip"
+(silently skip the path).
+</P>
+<P>
+<b>-d</b> <i>action</i>, <b>--directories=</b><i>action</i>
+If an input path is a directory, "action" specifies how it is to be processed.
+Valid values are "read" (the default), "recurse" (equivalent to the <b>-r</b>
+option), or "skip" (silently skip the path). In the default case, directories
+are read as if they were ordinary files. In some operating systems the effect
+of reading a directory like this is an immediate end-of-file.
+</P>
+<P>
+<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>,
+<b>--regexp=</b><i>pattern</i> Specify a pattern to be matched. This option can
+be used multiple times in order to specify several patterns. It can also be
+used as a way of specifying a single pattern that starts with a hyphen. When
+<b>-e</b> is used, no argument pattern is taken from the command line; all
+arguments are treated as file names. There is an overall maximum of 100
+patterns. They are applied to each line in the order in which they are defined
+until one matches (or fails to match if <b>-v</b> is used). If <b>-f</b> is used
+with <b>-e</b>, the command line patterns are matched first, followed by the
+patterns from the file, independent of the order in which these options are
+specified. Note that multiple use of <b>-e</b> is not the same as a single
+pattern with alternatives. For example, X|Y finds the first character in a line
+that is X or Y, whereas if the two patterns are given separately,
+<b>pcregrep</b> finds X if it is present, even if it follows Y in the line. It
+finds Y only if there is no X in the line. This really matters only if you are
+using <b>-o</b> to show the portion of the line that matched.
+</P>
+<P>
+<b>--exclude</b>=<i>pattern</i>
+When <b>pcregrep</b> is searching the files in a directory as a consequence of
+the <b>-r</b> (recursive search) option, any files whose names match the pattern
+are excluded. The pattern is a PCRE regular expression. If a file name matches
+both <b>--include</b> and <b>--exclude</b>, it is excluded. There is no short
+form for this option.
+</P>
+<P>
+<b>-F</b>, <b>--fixed-strings</b>
+Interpret each pattern as a list of fixed strings, separated by newlines,
+instead of as a regular expression. The <b>-w</b> (match as a word) and <b>-x</b>
+(match whole line) options can be used with <b>-F</b>. They apply to each of the
+fixed strings. A line is selected if any of the fixed strings are found in it
+(subject to <b>-w</b> or <b>-x</b>, if present).
+</P>
+<P>
+<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
+Read a number of patterns from the file, one per line, and match them against
+each line of input. A data line is output if any of the patterns match it. The
+filename can be given as "-" to refer to the standard input. When <b>-f</b> is
+used, patterns specified on the command line using <b>-e</b> may also be
+present; they are tested before the file's patterns. However, no other pattern
+is taken from the command line; all arguments are treated as file names. There
+is an overall maximum of 100 patterns. Trailing white space is removed from
+each line, and blank lines are ignored. An empty file contains no patterns and
+therefore matches nothing.
+</P>
+<P>
+<b>-H</b>, <b>--with-filename</b>
+Force the inclusion of the filename at the start of output lines when searching
+a single file. By default, the filename is not shown in this case. For matching
+lines, the filename is followed by a colon and a space; for context lines, a
+hyphen separator is used. If a line number is also being output, it follows the
+file name without a space.
+</P>
+<P>
+<b>-h</b>, <b>--no-filename</b>
+Suppress the output filenames when searching multiple files. By default,
+filenames are shown when multiple files are searched. For matching lines, the
+filename is followed by a colon and a space; for context lines, a hyphen
+separator is used. If a line number is also being output, it follows the file
+name without a space.
 </P>
 <P>
-<b>-c</b>
-Do not print individual lines; instead just print a count of the number of
-lines that would otherwise have been printed. If several files are given, a
-count is printed for each of them.
+<b>--help</b>
+Output a brief help message and exit.
 </P>
 <P>
-<b>-f</b><i>filename</i>
-Read a number of patterns from the file, one per line, and match all of them
-against each line of input. A line is output if any of the patterns match it.
-When <b>-f</b> is used, no pattern is taken from the command line; all arguments
-are treated as file names. There is a maximum of 100 patterns. Trailing white
-space is removed, and blank lines are ignored. An empty file contains no
-patterns and therefore matches nothing.
+<b>-i</b>, <b>--ignore-case</b>
+Ignore upper/lower case distinctions during comparisons.
 </P>
 <P>
-<b>-h</b>
-Suppress printing of filenames when searching multiple files.
+<b>--include</b>=<i>pattern</i>
+When <b>pcregrep</b> is searching the files in a directory as a consequence of
+the <b>-r</b> (recursive search) option, only those files whose names match the
+pattern are included. The pattern is a PCRE regular expression. If a file name
+matches both <b>--include</b> and <b>--exclude</b>, it is excluded. There is no
+short form for this option.
+</P>
+<P>
+<b>-L</b>, <b>--files-without-match</b>
+Instead of outputting lines from the files, just output the names of the files
+that do not contain any lines that would have been output. Each file name is
+output once, on a separate line.
+</P>
+<P>
+<b>-l</b>, <b>--files-with-matches</b>
+Instead of outputting lines from the files, just output the names of the files
+containing lines that would have been output. Each file name is output
+once, on a separate line. Searching stops as soon as a matching line is found
+in a file.
+</P>
+<P>
+<b>--label</b>=<i>name</i>
+This option supplies a name to be used for the standard input when file names
+are being output. If not supplied, "(standard input)" is used. There is no
+short form for this option.
+</P>
+<P>
+<b>--locale</b>=<i>locale-name</i>
+This option specifies a locale to be used for pattern matching. It overrides
+the value in the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variables. If no
+locale is specified, the PCRE library's default (usually the "C" locale) is
+used. There is no short form for this option.
+</P>
+<P>
+<b>-M</b>, <b>--multiline</b>
+Allow patterns to match more than one line. When this option is given, patterns
+may usefully contain literal newline characters and internal occurrences of ^
+and $ characters. The output for any one match may consist of more than one
+line. When this option is set, the PCRE library is called in "multiline" mode.
+There is a limit to the number of lines that can be matched, imposed by the way
+that <b>pcregrep</b> buffers the input file as it scans it. However,
+<b>pcregrep</b> ensures that at least 8K characters or the rest of the document
+(whichever is the shorter) are available for forward matching, and similarly
+the previous 8K characters (or all the previous characters, if fewer than 8K)
+are guaranteed to be available for lookbehind assertions.
+</P>
+<P>
+<b>-N</b> <i>newline-type</i>, <b>--newline=</b><i>newline-type</i>
+The PCRE library supports five different conventions for indicating
+the ends of lines. They are the single-character sequences CR (carriage return)
+and LF (linefeed), the two-character sequence CRLF, an "anycrlf" convention,
+which recognizes any of the preceding three types, and an "any" convention, in
+which any Unicode line ending sequence is assumed to end a line. The Unicode
+sequences are the three just mentioned, plus VT (vertical tab, U+000B), FF
+(formfeed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and
+PS (paragraph separator, U+2029).
+<br>
+<br>
+When the PCRE library is built, a default line-ending sequence is specified.
+This is normally the standard sequence for the operating system. Unless
+otherwise specified by this option, <b>pcregrep</b> uses the library's default.
+The possible values for this option are CR, LF, CRLF, ANYCRLF, or ANY. This
+makes it possible to use <b>pcregrep</b> on files that have come from other
+environments without having to modify their line endings. If the data that is
+being scanned does not agree with the convention set by this option,
+<b>pcregrep</b> may behave in strange ways.
 </P>
 <P>
-<b>-i</b>
-Ignore upper/lower case distinctions during comparisons.
+<b>-n</b>, <b>--line-number</b>
+Precede each output line by its line number in the file, followed by a colon
+and a space for matching lines or a hyphen and a space for context lines. If
+the filename is also being output, it precedes the line number.
 </P>
 <P>
-<b>-l</b>
-Instead of printing lines from the files, just print the names of the files
-containing lines that would have been printed. Each file name is printed
-once, on a separate line.
+<b>-o</b>, <b>--only-matching</b>
+Show only the part of the line that matched a pattern. In this mode, no
+context is shown. That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b> options are
+ignored.
 </P>
 <P>
-<b>-n</b>
-Precede each line by its line number in the file.
+<b>-q</b>, <b>--quiet</b>
+Work quietly, that is, display nothing except error messages. The exit
+status indicates whether or not any matches were found.
 </P>
 <P>
-<b>-r</b>
-If any file is a directory, recursively scan the files it contains. Without
-<b>-r</b> a directory is scanned as a normal file.
+<b>-r</b>, <b>--recursive</b>
+If any given path is a directory, recursively scan the files it contains,
+taking note of any <b>--include</b> and <b>--exclude</b> settings. By default, a
+directory is read as a normal file; in some operating systems this gives an
+immediate end-of-file. This option is a shorthand for setting the <b>-d</b>
+option to "recurse".
 </P>
 <P>
-<b>-s</b>
-Work silently, that is, display nothing except error messages.
-The exit status indicates whether any matches were found.
+<b>-s</b>, <b>--no-messages</b>
+Suppress error messages about non-existent or unreadable files. Such files are
+quietly skipped. However, the return code is still 2, even if matches were
+found in other files.
 </P>
 <P>
-<b>-u</b>
+<b>-u</b>, <b>--utf-8</b>
 Operate in UTF-8 mode. This option is available only if PCRE has been compiled
-with UTF-8 support. Both the pattern and each subject line must be valid
-strings of UTF-8 characters.
+with UTF-8 support. Both patterns and subject lines must be valid strings of
+UTF-8 characters.
+</P>
+<P>
+<b>-V</b>, <b>--version</b>
+Write the version numbers of <b>pcregrep</b> and the PCRE library that is being
+used to the standard error stream.
+</P>
+<P>
+<b>-v</b>, <b>--invert-match</b>
+Invert the sense of the match, so that lines which do <i>not</i> match any of
+the patterns are the ones that are found.
 </P>
 <P>
-<b>-v</b>
-Invert the sense of the match, so that lines which do <i>not</i> match the
-pattern are now the ones that are found.
+<b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b>
+Force the patterns to match only whole words. This is equivalent to having \b
+at the start and end of the pattern.
 </P>
 <P>
-<b>-x</b>
-Force the pattern to be anchored (it must start matching at the beginning of
-the line) and in addition, require it to match the entire line. This is
+<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
+Force the patterns to be anchored (each must start matching at the beginning of
+a line) and in addition, require them to match entire lines. This is
 equivalent to having ^ and $ characters at the start and end of each
-alternative branch in the regular expression.
+alternative branch in every pattern.
 </P>
-<br><a name="SEC4" href="#TOC1">LONG OPTIONS</a><br>
+<br><a name="SEC4" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
 <P>
-Long forms of all the options are available, as in GNU grep. They are shown in
-the following table:
+The environment variables <b>LC_ALL</b> and <b>LC_CTYPE</b> are examined, in that
+order, for a locale. The first one that is set is used. This can be overridden
+by the <b>--locale</b> option. If no locale is set, the PCRE library's default
+(usually the "C" locale) is used.
+</P>
+<br><a name="SEC5" href="#TOC1">NEWLINES</a><br>
+<P>
+The <b>-N</b> (<b>--newline</b>) option allows <b>pcregrep</b> to scan files with
+different newline conventions from the default. However, the setting of this
+option does not affect the way in which <b>pcregrep</b> writes information to
+the standard error and output streams. It uses the string "\n" in C
+<b>printf()</b> calls to indicate newlines, relying on the C I/O library to
+convert this to an appropriate sequence if the output is sent to a file.
+</P>
+<br><a name="SEC6" href="#TOC1">OPTIONS COMPATIBILITY</a><br>
+<P>
+The majority of short and long forms of <b>pcregrep</b>'s options are the same
+as in the GNU <b>grep</b> program. Any long option of the form
+<b>--xxx-regexp</b> (GNU terminology) is also available as <b>--xxx-regex</b>
+(PCRE terminology). However, the <b>--locale</b>, <b>-M</b>, <b>--multiline</b>,
+<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>.
+</P>
+<br><a name="SEC7" href="#TOC1">OPTIONS WITH DATA</a><br>
+<P>
+There are four different ways in which an option with data can be specified.
+If a short form option is used, the data may follow immediately, or in the next
+command line item. For example:
 <pre>
-  -c   --count
-  -h   --no-filename
-  -i   --ignore-case
-  -l   --files-with-matches
-  -n   --line-number
-  -r   --recursive
-  -s   --no-messages
-  -u   --utf-8
-  -V   --version
-  -v   --invert-match
-  -x   --line-regex
-  -x   --line-regexp
+  -f/some/file
+  -f /some/file
 </pre>
-In addition, --file=<i>filename</i> is equivalent to -f<i>filename</i>, and
---help shows the list of options and then exits.
+If a long form option is used, the data may appear in the same command line
+item, separated by an equals character, or (with one exception) it may appear
+in the next command line item. For example:
+<pre>
+  --file=/some/file
+  --file /some/file
+</pre>
+Note, however, that if you want to supply a file name beginning with ~ as data
+in a shell command, and have the shell expand ~ to a home directory, you must
+separate the file name from the option, because the shell does not treat ~
+specially unless it is at the start of an item.
+</P>
+<P>
+The exception to the above is the <b>--colour</b> (or <b>--color</b>) option,
+for which the data is optional. If this option does have data, it must be given
+in the first form, using an equals character. Otherwise it will be assumed that
+it has no data.
+</P>
+<br><a name="SEC8" href="#TOC1">MATCHING ERRORS</a><br>
+<P>
+It is possible to supply a regular expression that takes a very long time to
+fail to match certain lines. Such patterns normally involve nested indefinite
+repeats, for example: (a+)*\d when matched against a line of a's with no final
+digit. The PCRE matching function has a resource limit that causes it to abort
+in these circumstances. If this happens, <b>pcregrep</b> outputs an error
+message and the line that caused the problem to the standard error stream. If
+there are more than 20 such errors, <b>pcregrep</b> gives up.
 </P>
-<br><a name="SEC5" href="#TOC1">DIAGNOSTICS</a><br>
+<br><a name="SEC9" href="#TOC1">DIAGNOSTICS</a><br>
 <P>
 Exit status is 0 if any matches were found, 1 if no matches were found, and 2
-for syntax errors or inacessible files (even if matches were found).
+for syntax errors and non-existent or inacessible files (even if matches were
+found in other files) or too many matching errors. Using the <b>-s</b> option to
+suppress error messages about inaccessble files does not affect the return
+code.
+</P>
+<br><a name="SEC10" href="#TOC1">SEE ALSO</a><br>
+<P>
+<b>pcrepattern</b>(3), <b>pcretest</b>(1).
 </P>
-<br><a name="SEC6" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
 <P>
-Philip Hazel &#60;ph10@cam.ac.uk&#62;
+Philip Hazel
 <br>
 University Computing Service
 <br>
-Cambridge CB2 3QG, England.
+Cambridge CB2 3QH, England.
+<br>
 </P>
+<br><a name="SEC12" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 09 September 2004
+Last updated: 16 April 2007
+<br>
+Copyright &copy; 1997-2007 University of Cambridge.
 <br>
-Copyright &copy; 1997-2004 University of Cambridge.
 <p>
 Return to the <a href="index.html">PCRE index page</a>.
 </p>

Added: httpd/httpd/vendor/pcre/current/doc/html/pcrematching.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcrematching.html?rev=598339&view=auto
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcrematching.html (added)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcrematching.html Mon Nov 26 08:49:53 2007
@@ -0,0 +1,223 @@
+<html>
+<head>
+<title>pcrematching specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcrematching man page</h1>
+<p>
+Return to the <a href="index.html">PCRE index page</a>.
+</p>
+<p>
+This page is part of the PCRE HTML documentation. It was generated automatically
+from the original man page. If there is any nonsense in it, please consult the
+man page, in case the conversion went wrong.
+<br>
+<ul>
+<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a>
+<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
+<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
+<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
+<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
+<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
+<li><a name="TOC7" href="#SEC7">AUTHOR</a>
+<li><a name="TOC8" href="#SEC8">REVISION</a>
+</ul>
+<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br>
+<P>
+This document describes the two different algorithms that are available in PCRE
+for matching a compiled regular expression against a given subject string. The
+"standard" algorithm is the one provided by the <b>pcre_exec()</b> function.
+This works in the same was as Perl's matching function, and provides a
+Perl-compatible matching operation.
+</P>
+<P>
+An alternative algorithm is provided by the <b>pcre_dfa_exec()</b> function;
+this operates in a different way, and is not Perl-compatible. It has advantages
+and disadvantages compared with the standard algorithm, and these are described
+below.
+</P>
+<P>
+When there is only one possible way in which a given subject string can match a
+pattern, the two algorithms give the same answer. A difference arises, however,
+when there are multiple possibilities. For example, if the pattern
+<pre>
+  ^&#60;.*&#62;
+</pre>
+is matched against the string
+<pre>
+  &#60;something&#62; &#60;something else&#62; &#60;something further&#62;
+</pre>
+there are three possible answers. The standard algorithm finds only one of
+them, whereas the alternative algorithm finds all three.
+</P>
+<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
+<P>
+The set of strings that are matched by a regular expression can be represented
+as a tree structure. An unlimited repetition in the pattern makes the tree of
+infinite size, but it is still a tree. Matching the pattern to a given subject
+string (from a given starting point) can be thought of as a search of the tree.
+There are two ways to search a tree: depth-first and breadth-first, and these
+correspond to the two matching algorithms provided by PCRE.
+</P>
+<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
+<P>
+In the terminology of Jeffrey Friedl's book "Mastering Regular
+Expressions", the standard algorithm is an "NFA algorithm". It conducts a
+depth-first search of the pattern tree. That is, it proceeds along a single
+path through the tree, checking that the subject matches what is required. When
+there is a mismatch, the algorithm tries any alternatives at the current point,
+and if they all fail, it backs up to the previous branch point in the tree, and
+tries the next alternative branch at that level. This often involves backing up
+(moving to the left) in the subject string as well. The order in which
+repetition branches are tried is controlled by the greedy or ungreedy nature of
+the quantifier.
+</P>
+<P>
+If a leaf node is reached, a matching string has been found, and at that point
+the algorithm stops. Thus, if there is more than one possible match, this
+algorithm returns the first one that it finds. Whether this is the shortest,
+the longest, or some intermediate length depends on the way the greedy and
+ungreedy repetition quantifiers are specified in the pattern.
+</P>
+<P>
+Because it ends up with a single path through the tree, it is relatively
+straightforward for this algorithm to keep track of the substrings that are
+matched by portions of the pattern in parentheses. This provides support for
+capturing parentheses and back references.
+</P>
+<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
+<P>
+This algorithm conducts a breadth-first search of the tree. Starting from the
+first matching point in the subject, it scans the subject string from left to
+right, once, character by character, and as it does this, it remembers all the
+paths through the tree that represent valid matches. In Friedl's terminology,
+this is a kind of "DFA algorithm", though it is not implemented as a
+traditional finite state machine (it keeps multiple states active
+simultaneously).
+</P>
+<P>
+The scan continues until either the end of the subject is reached, or there are
+no more unterminated paths. At this point, terminated paths represent the
+different matching possibilities (if there are none, the match has failed).
+Thus, if there is more than one possible match, this algorithm finds all of
+them, and in particular, it finds the longest. In PCRE, there is an option to
+stop the algorithm after the first match (which is necessarily the shortest)
+has been found.
+</P>
+<P>
+Note that all the matches that are found start at the same point in the
+subject. If the pattern
+<pre>
+  cat(er(pillar)?)
+</pre>
+is matched against the string "the caterpillar catchment", the result will be
+the three strings "cat", "cater", and "caterpillar" that start at the fourth
+character of the subject. The algorithm does not automatically move on to find
+matches that start at later positions.
+</P>
+<P>
+There are a number of features of PCRE regular expressions that are not
+supported by the alternative matching algorithm. They are as follows:
+</P>
+<P>
+1. Because the algorithm finds all possible matches, the greedy or ungreedy
+nature of repetition quantifiers is not relevant. Greedy and ungreedy
+quantifiers are treated in exactly the same way. However, possessive
+quantifiers can make a difference when what follows could also match what is
+quantified, for example in a pattern like this:
+<pre>
+  ^a++\w!
+</pre>
+This pattern matches "aaab!" but not "aaa!", which would be matched by a
+non-possessive quantifier. Similarly, if an atomic group is present, it is
+matched as if it were a standalone pattern at the current point, and the
+longest match is then "locked in" for the rest of the overall pattern.
+</P>
+<P>
+2. When dealing with multiple paths through the tree simultaneously, it is not
+straightforward to keep track of captured substrings for the different matching
+possibilities, and PCRE's implementation of this algorithm does not attempt to
+do this. This means that no captured substrings are available.
+</P>
+<P>
+3. Because no substrings are captured, back references within the pattern are
+not supported, and cause errors if encountered.
+</P>
+<P>
+4. For the same reason, conditional expressions that use a backreference as the
+condition or test for a specific group recursion are not supported.
+</P>
+<P>
+5. Because many paths through the tree may be active, the \K escape sequence,
+which resets the start of the match when encountered (but may be on some paths
+and not on others), is not supported. It causes an error if encountered.
+</P>
+<P>
+6. Callouts are supported, but the value of the <i>capture_top</i> field is
+always 1, and the value of the <i>capture_last</i> field is always -1.
+</P>
+<P>
+7. The \C escape sequence, which (in the standard algorithm) matches a single
+byte, even in UTF-8 mode, is not supported because the alternative algorithm
+moves through the subject string one character at a time, for all active paths
+through the tree.
+</P>
+<P>
+8. None of the backtracking control verbs such as (*PRUNE) are supported.
+</P>
+<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
+<P>
+Using the alternative matching algorithm provides the following advantages:
+</P>
+<P>
+1. All possible matches (at a single point in the subject) are automatically
+found, and in particular, the longest match is found. To find more than one
+match using the standard algorithm, you have to do kludgy things with
+callouts.
+</P>
+<P>
+2. There is much better support for partial matching. The restrictions on the
+content of the pattern that apply when using the standard algorithm for partial
+matching do not apply to the alternative algorithm. For non-anchored patterns,
+the starting position of a partial match is available.
+</P>
+<P>
+3. Because the alternative algorithm scans the subject string just once, and
+never needs to backtrack, it is possible to pass very long subject strings to
+the matching function in several pieces, checking for partial matching each
+time.
+</P>
+<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
+<P>
+The alternative algorithm suffers from a number of disadvantages:
+</P>
+<P>
+1. It is substantially slower than the standard algorithm. This is partly
+because it has to search for all possible matches, but is also because it is
+less susceptible to optimization.
+</P>
+<P>
+2. Capturing parentheses and back references are not supported.
+</P>
+<P>
+3. Although atomic groups are supported, their use does not provide the
+performance advantage that it does for the standard algorithm.
+</P>
+<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
+<P>
+Philip Hazel
+<br>
+University Computing Service
+<br>
+Cambridge CB2 3QH, England.
+<br>
+</P>
+<br><a name="SEC8" href="#TOC1">REVISION</a><br>
+<P>
+Last updated: 08 August 2007
+<br>
+Copyright &copy; 1997-2007 University of Cambridge.
+<br>
+<p>
+Return to the <a href="index.html">PCRE index page</a>.
+</p>

Modified: httpd/httpd/vendor/pcre/current/doc/html/pcrepartial.html
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/html/pcrepartial.html?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/html/pcrepartial.html (original)
+++ httpd/httpd/vendor/pcre/current/doc/html/pcrepartial.html Mon Nov 26 08:49:53 2007
@@ -16,14 +16,17 @@
 <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
 <li><a name="TOC2" href="#SEC2">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a>
 <li><a name="TOC3" href="#SEC3">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
+<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
+<li><a name="TOC5" href="#SEC5">AUTHOR</a>
+<li><a name="TOC6" href="#SEC6">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
 <P>
 In normal use of PCRE, if the subject string that is passed to
-<b>pcre_exec()</b> matches as far as it goes, but is too short to match the
-entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where
-it might be helpful to distinguish this case from other cases in which there is
-no match.
+<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
+too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
+are circumstances where it might be helpful to distinguish this case from other
+cases in which there is no match.
 </P>
 <P>
 Consider, for example, an application where a human is required to type in data
@@ -41,10 +44,20 @@
 </P>
 <P>
 PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
-option, which can be set when calling <b>pcre_exec()</b>. When this is done, the
-return code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any
-time during the matching process the entire subject string matched part of the
-pattern. No captured data is set when this occurs.
+option, which can be set when calling <b>pcre_exec()</b> or
+<b>pcre_dfa_exec()</b>. When this flag is set for <b>pcre_exec()</b>, the return
+code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
+during the matching process the last part of the subject string matched part of
+the pattern. Unfortunately, for non-anchored matching, it is not possible to
+obtain the position of the start of the partial match. No captured data is set
+when PCRE_ERROR_PARTIAL is returned.
+</P>
+<P>
+When PCRE_PARTIAL is set for <b>pcre_dfa_exec()</b>, the return code
+PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
+subject is reached, there have been no complete matches, but there is still at
+least one matching possibility. The portion of the string that provided the
+partial match is set as the first matching string.
 </P>
 <P>
 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
@@ -54,9 +67,10 @@
 </P>
 <br><a name="SEC2" href="#TOC1">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a><br>
 <P>
-Because of the way certain internal optimizations are implemented in PCRE, the
-PCRE_PARTIAL option cannot be used with all patterns. Repeated single
-characters such as
+Because of the way certain internal optimizations are implemented in the
+<b>pcre_exec()</b> function, the PCRE_PARTIAL option cannot be used with all
+patterns. These restrictions do not apply when <b>pcre_dfa_exec()</b> is used.
+For <b>pcre_exec()</b>, repeated single characters such as
 <pre>
   a{2,4}
 </pre>
@@ -78,6 +92,8 @@
 <P>
 If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
 <b>pcre_exec()</b> returns the error code PCRE_ERROR_BADPARTIAL (-13).
+You can use the PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out
+if a compiled pattern can be used for partial matching.
 </P>
 <br><a name="SEC3" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
 <P>
@@ -100,12 +116,127 @@
 </pre>
 The first data string is matched completely, so <b>pcretest</b> shows the
 matched substrings. The remaining four strings do not match the complete
-pattern, but the first two are partial matches.
+pattern, but the first two are partial matches. The same test, using
+<b>pcre_dfa_exec()</b> matching (by means of the \D escape sequence), produces
+the following output:
+<pre>
+    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+  data&#62; 25jun04\P\D
+   0: 25jun04
+  data&#62; 23dec3\P\D
+  Partial match: 23dec3
+  data&#62; 3ju\P\D
+  Partial match: 3ju
+  data&#62; 3juj\P\D
+  No match
+  data&#62; j\P\D
+  No match
+</pre>
+Notice that in this case the portion of the string that was matched is made
+available.
+</P>
+<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
+<P>
+When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
+to continue the match by providing additional subject data and calling
+<b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
+time setting the PCRE_DFA_RESTART option. You must also pass the same working
+space as before, because this is where details of the previous partial match
+are stored. Here is an example using <b>pcretest</b>, using the \R escape
+sequence to set the PCRE_DFA_RESTART option (\P and \D are as above):
+<pre>
+    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+  data&#62; 23ja\P\D
+  Partial match: 23ja
+  data&#62; n05\R\D
+   0: n05
+</pre>
+The first call has "23ja" as the subject, and requests partial matching; the
+second call has "n05" as the subject for the continued (restarted) match.
+Notice that when the match is complete, only the last part is shown; PCRE does
+not retain the previously partially-matched string. It is up to the calling
+program to do that if it needs to.
+</P>
+<P>
+You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
+over multiple segments. This facility can be used to pass very long subject
+strings to <b>pcre_dfa_exec()</b>. However, some care is needed for certain
+types of pattern.
+</P>
+<P>
+1. If the pattern contains tests for the beginning or end of a line, you need
+to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
+subject string for any call does not contain the beginning or end of a line.
+</P>
+<P>
+2. If the pattern contains backward assertions (including \b or \B), you need
+to arrange for some overlap in the subject strings to allow for this. For
+example, you could pass the subject in chunks that are 500 bytes long, but in
+a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
+bytes at the start of the buffer.
+</P>
+<P>
+3. Matching a subject string that is split into multiple segments does not
+always produce exactly the same result as matching over one single long string.
+The difference arises when there are multiple matching possibilities, because a
+partial match result is given only when there are no completed matches in a
+call to <b>pcre_dfa_exec()</b>. This means that as soon as the shortest match has
+been found, continuation to a new subject segment is no longer possible.
+Consider this <b>pcretest</b> example:
+<pre>
+    re&#62; /dog(sbody)?/
+  data&#62; do\P\D
+  Partial match: do
+  data&#62; gsb\R\P\D
+   0: g
+  data&#62; dogsbody\D
+   0: dogsbody
+   1: dog
+</pre>
+The pattern matches the words "dog" or "dogsbody". When the subject is
+presented in several parts ("do" and "gsb" being the first two) the match stops
+when "dog" has been found, and it is not possible to continue. On the other
+hand, if "dogsbody" is presented as a single string, both matches are found.
 </P>
 <P>
-Last updated: 08 September 2004
+Because of this phenomenon, it does not usually make sense to end a pattern
+that is going to be matched in this way with a variable repeat.
+</P>
+<P>
+4. Patterns that contain alternatives at the top level which do not all
+start with the same pattern item may not work as expected. For example,
+consider this pattern:
+<pre>
+  1234|3789
+</pre>
+If the first part of the subject is "ABC123", a partial match of the first
+alternative is found at offset 3. There is no partial match for the second
+alternative, because such a match does not start at the same point in the
+subject string. Attempting to continue with the string "789" does not yield a
+match because only those alternatives that match at one point in the subject
+are remembered. The problem arises because the start of the second alternative
+matches within the first alternative. There is no problem with anchored
+patterns or patterns such as:
+<pre>
+  1234|ABCD
+</pre>
+where no string can be a partial match for both alternatives.
+</P>
+<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
+<P>
+Philip Hazel
+<br>
+University Computing Service
+<br>
+Cambridge CB2 3QH, England.
+<br>
+</P>
+<br><a name="SEC6" href="#TOC1">REVISION</a><br>
+<P>
+Last updated: 04 June 2007
+<br>
+Copyright &copy; 1997-2007 University of Cambridge.
 <br>
-Copyright &copy; 1997-2004 University of Cambridge.
 <p>
 Return to the <a href="index.html">PCRE index page</a>.
 </p>