You are viewing a plain text version of this content. The canonical link for it is here.
Posted to cvs@httpd.apache.org by pg...@apache.org on 2007/11/26 17:50:09 UTC
svn commit: r598339 [20/37] - in /httpd/httpd/vendor/pcre/current: ./ doc/
doc/html/ testdata/
Modified: httpd/httpd/vendor/pcre/current/doc/pcresample.3
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/pcresample.3?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/pcresample.3 (original)
+++ httpd/httpd/vendor/pcre/current/doc/pcresample.3 Mon Nov 26 08:49:53 2007
@@ -1,4 +1,4 @@
-.TH PCRE 3
+.TH PCRESAMPLE 3
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE SAMPLE PROGRAM"
@@ -18,9 +18,10 @@
string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
.P
-If PCRE is installed in the standard include and library directories for your
-system, you should be able to compile the demonstration program using this
-command:
+The demonstration program is automatically built if you use "./configure;make"
+to build PCRE. Otherwise, if PCRE is installed in the standard include and
+library directories for your system, you should be able to compile the
+demonstration program using this command:
.sp
gcc -o pcredemo pcredemo.c -lpcre
.sp
@@ -59,8 +60,22 @@
-R/usr/local/lib
.sp
(for example) to the compile command to get round this problem.
-.P
-.in 0
-Last updated: 09 September 2004
-.br
-Copyright (c) 1997-2004 University of Cambridge.
+.
+.
+.SH AUTHOR
+.rs
+.sp
+.nf
+Philip Hazel
+University Computing Service
+Cambridge CB2 3QH, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 13 June 2007
+Copyright (c) 1997-2007 University of Cambridge.
+.fi
Added: httpd/httpd/vendor/pcre/current/doc/pcrestack.3
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/pcrestack.3?rev=598339&view=auto
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/pcrestack.3 (added)
+++ httpd/httpd/vendor/pcre/current/doc/pcrestack.3 Mon Nov 26 08:49:53 2007
@@ -0,0 +1,140 @@
+.TH PCRESTACK 3
+.SH NAME
+PCRE - Perl-compatible regular expressions
+.SH "PCRE DISCUSSION OF STACK USAGE"
+.rs
+.sp
+When you call \fBpcre_exec()\fP, it makes use of an internal function called
+\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
+in order to remember the state of the match so that it can back up and try a
+different alternative if the first one fails. As matching proceeds deeper and
+deeper into the tree of possibilities, the recursion depth increases.
+.P
+Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
+as a* it may be called several times at the same level, after matching
+different numbers of a's. Furthermore, in a number of cases where the result of
+the recursive call would immediately be passed back as the result of the
+current call (a "tail recursion"), the function is just restarted instead.
+.P
+The \fBpcre_dfa_exec()\fP function operates in an entirely different way, and
+hardly uses recursion at all. The limit on its complexity is the amount of
+workspace it is given. The comments that follow do NOT apply to
+\fBpcre_dfa_exec()\fP; they are relevant only for \fBpcre_exec()\fP.
+.P
+You can set limits on the number of times that \fBmatch()\fP is called, both in
+total and recursively. If the limit is exceeded, an error occurs. For details,
+see the
+.\" HTML <a href="pcreapi.html#extradata">
+.\" </a>
+section on extra data for \fBpcre_exec()\fP
+.\"
+in the
+.\" HREF
+\fBpcreapi\fP
+.\"
+documentation.
+.P
+Each time that \fBmatch()\fP is actually called recursively, it uses memory
+from the process stack. For certain kinds of pattern and data, very large
+amounts of stack may be needed, despite the recognition of "tail recursion".
+You can often reduce the amount of recursion, and therefore the amount of stack
+used, by modifying the pattern that is being matched. Consider, for example,
+this pattern:
+.sp
+ ([^<]|<(?!inet))+
+.sp
+It matches from wherever it starts until it encounters "<inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "<" or a "<" that is not followed by "inet". However, each time a
+parenthesis is processed, a recursion occurs, so this formulation uses a stack
+frame for each matched character. For a long string, a lot of stack is
+required. Consider now this rewritten pattern, which matches exactly the same
+strings:
+.sp
+ ([^<]++|<(?!inet))+
+.sp
+This uses very much less stack, because runs of characters that do not contain
+"<" are "swallowed" in one item inside the parentheses. Recursion happens only
+when a "<" character that is not followed by "inet" is encountered (and we
+assume this is relatively rare). A possessive quantifier is used to stop any
+backtracking into the runs of non-"<" characters, but that is not related to
+stack usage.
+.P
+This example shows that one way of avoiding stack problems when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
+.P
+In environments where stack memory is constrained, you might want to compile
+PCRE to use heap memory instead of stack for remembering back-up points. This
+makes it run a lot more slowly, however. Details of how to do this are given in
+the
+.\" HREF
+\fBpcrebuild\fP
+.\"
+documentation. When built in this way, instead of using the stack, PCRE obtains
+and frees memory by calling the functions that are pointed to by the
+\fBpcre_stack_malloc\fP and \fBpcre_stack_free\fP variables. By default, these
+point to \fBmalloc()\fP and \fBfree()\fP, but you can replace the pointers to
+cause PCRE to use your own functions. Since the block sizes are always the
+same, and are always freed in reverse order, it may be possible to implement
+customized memory handlers that are more efficient than the standard functions.
+.P
+In Unix-like environments, there is not often a problem with the stack unless
+very long strings are involved, though the default limit on stack size varies
+from system to system. Values from 8Mb to 64Mb are common. You can find your
+default limit by running the command:
+.sp
+ ulimit -s
+.sp
+Unfortunately, the effect of running out of stack is often SIGSEGV, though
+sometimes a more explicit error message is given. You can normally increase the
+limit on stack size by code such as this:
+.sp
+ struct rlimit rlim;
+ getrlimit(RLIMIT_STACK, &rlim);
+ rlim.rlim_cur = 100*1024*1024;
+ setrlimit(RLIMIT_STACK, &rlim);
+.sp
+This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
+attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
+do this before calling \fBpcre_exec()\fP.
+.P
+PCRE has an internal counter that can be used to limit the depth of recursion,
+and thus cause \fBpcre_exec()\fP to give an error code before it runs out of
+stack. By default, the limit is very large, and unlikely ever to operate. It
+can be changed when PCRE is built, and it can also be set when
+\fBpcre_exec()\fP is called. For details of these interfaces, see the
+.\" HREF
+\fBpcrebuild\fP
+.\"
+and
+.\" HREF
+\fBpcreapi\fP
+.\"
+documentation.
+.P
+As a very rough rule of thumb, you should reckon on about 500 bytes per
+recursion. Thus, if you want to limit your stack usage to 8Mb, you
+should set the limit at 16000 recursions. A 64Mb stack, on the other hand, can
+support around 128000 recursions. The \fBpcretest\fP test program has a command
+line option (\fB-S\fP) that can be used to increase the size of its stack.
+.
+.
+.SH AUTHOR
+.rs
+.sp
+.nf
+Philip Hazel
+University Computing Service
+Cambridge CB2 3QH, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 05 June 2007
+Copyright (c) 1997-2007 University of Cambridge.
+.fi
Added: httpd/httpd/vendor/pcre/current/doc/pcresyntax.3
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/pcresyntax.3?rev=598339&view=auto
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/pcresyntax.3 (added)
+++ httpd/httpd/vendor/pcre/current/doc/pcresyntax.3 Mon Nov 26 08:49:53 2007
@@ -0,0 +1,423 @@
+.TH PCRESYNTAX 3
+.SH NAME
+PCRE - Perl-compatible regular expressions
+.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
+.rs
+.sp
+The full syntax and semantics of the regular expressions that are supported by
+PCRE are described in the
+.\" HREF
+\fBpcrepattern\fP
+.\"
+documentation. This document contains just a quick-reference summary of the
+syntax.
+.
+.
+.SH "QUOTING"
+.rs
+.sp
+ \ex where x is non-alphanumeric is a literal x
+ \eQ...\eE treat enclosed characters as literal
+.
+.
+.SH "CHARACTERS"
+.rs
+.sp
+ \ea alarm, that is, the BEL character (hex 07)
+ \ecx "control-x", where x is any character
+ \ee escape (hex 1B)
+ \ef formfeed (hex 0C)
+ \en newline (hex 0A)
+ \er carriage return (hex 0D)
+ \et tab (hex 09)
+ \eddd character with octal code ddd, or backreference
+ \exhh character with hex code hh
+ \ex{hhh..} character with hex code hhh..
+.
+.
+.SH "CHARACTER TYPES"
+.rs
+.sp
+ . any character except newline;
+ in dotall mode, any character whatsoever
+ \eC one byte, even in UTF-8 mode (best avoided)
+ \ed a decimal digit
+ \eD a character that is not a decimal digit
+ \eh a horizontal whitespace character
+ \eH a character that is not a horizontal whitespace character
+ \ep{\fIxx\fP} a character with the \fIxx\fP property
+ \eP{\fIxx\fP} a character without the \fIxx\fP property
+ \eR a newline sequence
+ \es a whitespace character
+ \eS a character that is not a whitespace character
+ \ev a vertical whitespace character
+ \eV a character that is not a vertical whitespace character
+ \ew a "word" character
+ \eW a "non-word" character
+ \eX an extended Unicode sequence
+.sp
+In PCRE, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII characters.
+.
+.
+.SH "GENERAL CATEGORY PROPERTY CODES FOR \ep and \eP"
+.rs
+.sp
+ C Other
+ Cc Control
+ Cf Format
+ Cn Unassigned
+ Co Private use
+ Cs Surrogate
+.sp
+ L Letter
+ Ll Lower case letter
+ Lm Modifier letter
+ Lo Other letter
+ Lt Title case letter
+ Lu Upper case letter
+ L& Ll, Lu, or Lt
+.sp
+ M Mark
+ Mc Spacing mark
+ Me Enclosing mark
+ Mn Non-spacing mark
+.sp
+ N Number
+ Nd Decimal number
+ Nl Letter number
+ No Other number
+.sp
+ P Punctuation
+ Pc Connector punctuation
+ Pd Dash punctuation
+ Pe Close punctuation
+ Pf Final punctuation
+ Pi Initial punctuation
+ Po Other punctuation
+ Ps Open punctuation
+.sp
+ S Symbol
+ Sc Currency symbol
+ Sk Modifier symbol
+ Sm Mathematical symbol
+ So Other symbol
+.sp
+ Z Separator
+ Zl Line separator
+ Zp Paragraph separator
+ Zs Space separator
+.
+.
+.SH "SCRIPT NAMES FOR \ep AND \eP"
+.rs
+.sp
+Arabic,
+Armenian,
+Balinese,
+Bengali,
+Bopomofo,
+Braille,
+Buginese,
+Buhid,
+Canadian_Aboriginal,
+Cherokee,
+Common,
+Coptic,
+Cuneiform,
+Cypriot,
+Cyrillic,
+Deseret,
+Devanagari,
+Ethiopic,
+Georgian,
+Glagolitic,
+Gothic,
+Greek,
+Gujarati,
+Gurmukhi,
+Han,
+Hangul,
+Hanunoo,
+Hebrew,
+Hiragana,
+Inherited,
+Kannada,
+Katakana,
+Kharoshthi,
+Khmer,
+Lao,
+Latin,
+Limbu,
+Linear_B,
+Malayalam,
+Mongolian,
+Myanmar,
+New_Tai_Lue,
+Nko,
+Ogham,
+Old_Italic,
+Old_Persian,
+Oriya,
+Osmanya,
+Phags_Pa,
+Phoenician,
+Runic,
+Shavian,
+Sinhala,
+Syloti_Nagri,
+Syriac,
+Tagalog,
+Tagbanwa,
+Tai_Le,
+Tamil,
+Telugu,
+Thaana,
+Thai,
+Tibetan,
+Tifinagh,
+Ugaritic,
+Yi.
+.
+.
+.SH "CHARACTER CLASSES"
+.rs
+.sp
+ [...] positive character class
+ [^...] negative character class
+ [x-y] range (can be used for hex characters)
+ [[:xxx:]] positive POSIX named set
+ [[^:xxx:]] negative POSIX named set
+.sp
+ alnum alphanumeric
+ alpha alphabetic
+ ascii 0-127
+ blank space or tab
+ cntrl control character
+ digit decimal digit
+ graph printing, excluding space
+ lower lower case letter
+ print printing, including space
+ punct printing, excluding alphanumeric
+ space whitespace
+ upper upper case letter
+ word same as \ew
+ xdigit hexadecimal digit
+.sp
+In PCRE, POSIX character set names recognize only ASCII characters. You can use
+\eQ...\eE inside a character class.
+.
+.
+.SH "QUANTIFIERS"
+.rs
+.sp
+ ? 0 or 1, greedy
+ ?+ 0 or 1, possessive
+ ?? 0 or 1, lazy
+ * 0 or more, greedy
+ *+ 0 or more, possessive
+ *? 0 or more, lazy
+ + 1 or more, greedy
+ ++ 1 or more, possessive
+ +? 1 or more, lazy
+ {n} exactly n
+ {n,m} at least n, no more than m, greedy
+ {n,m}+ at least n, no more than m, possessive
+ {n,m}? at least n, no more than m, lazy
+ {n,} n or more, greedy
+ {n,}+ n or more, possessive
+ {n,}? n or more, lazy
+.
+.
+.SH "ANCHORS AND SIMPLE ASSERTIONS"
+.rs
+.sp
+ \eb word boundary
+ \eB not a word boundary
+ ^ start of subject
+ also after internal newline in multiline mode
+ \eA start of subject
+ $ end of subject
+ also before newline at end of subject
+ also before internal newline in multiline mode
+ \eZ end of subject
+ also before newline at end of subject
+ \ez end of subject
+ \eG first matching position in subject
+.
+.
+.SH "MATCH POINT RESET"
+.rs
+.sp
+ \eK reset start of match
+.
+.
+.SH "ALTERNATION"
+.rs
+.sp
+ expr|expr|expr...
+.
+.
+.SH "CAPTURING"
+.rs
+.sp
+ (...) capturing group
+ (?<name>...) named capturing group (Perl)
+ (?'name'...) named capturing group (Perl)
+ (?P<name>...) named capturing group (Python)
+ (?:...) non-capturing group
+ (?|...) non-capturing group; reset group numbers for
+ capturing groups in each alternative
+.
+.
+.SH "ATOMIC GROUPS"
+.rs
+.sp
+ (?>...) atomic, non-capturing group
+.
+.
+.
+.
+.SH "COMMENT"
+.rs
+.sp
+ (?#....) comment (not nestable)
+.
+.
+.SH "OPTION SETTING"
+.rs
+.sp
+ (?i) caseless
+ (?J) allow duplicate names
+ (?m) multiline
+ (?s) single line (dotall)
+ (?U) default ungreedy (lazy)
+ (?x) extended (ignore white space)
+ (?-...) unset option(s)
+.
+.
+.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
+.rs
+.sp
+ (?=...) positive look ahead
+ (?!...) negative look ahead
+ (?<=...) positive look behind
+ (?<!...) negative look behind
+.sp
+Each top-level branch of a look behind must be of a fixed length.
+.SH "BACKREFERENCES"
+.rs
+.sp
+ \en reference by number (can be ambiguous)
+ \egn reference by number
+ \eg{n} reference by number
+ \eg{-n} relative reference by number
+ \ek<name> reference by name (Perl)
+ \ek'name' reference by name (Perl)
+ \eg{name} reference by name (Perl)
+ \ek{name} reference by name (.NET)
+ (?P=name) reference by name (Python)
+.
+.
+.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
+.rs
+.sp
+ (?R) recurse whole pattern
+ (?n) call subpattern by absolute number
+ (?+n) call subpattern by relative number
+ (?-n) call subpattern by relative number
+ (?&name) call subpattern by name (Perl)
+ (?P>name) call subpattern by name (Python)
+.
+.
+.SH "CONDITIONAL PATTERNS"
+.rs
+.sp
+ (?(condition)yes-pattern)
+ (?(condition)yes-pattern|no-pattern)
+.sp
+ (?(n)... absolute reference condition
+ (?(+n)... relative reference condition
+ (?(-n)... relative reference condition
+ (?(<name>)... named reference condition (Perl)
+ (?('name')... named reference condition (Perl)
+ (?(name)... named reference condition (PCRE)
+ (?(R)... overall recursion condition
+ (?(Rn)... specific group recursion condition
+ (?(R&name)... specific recursion condition
+ (?(DEFINE)... define subpattern for reference
+ (?(assert)... assertion condition
+.
+.
+.SH "BACKTRACKING CONTROL"
+.rs
+.sp
+The following act immediately they are reached:
+.sp
+ (*ACCEPT) force successful match
+ (*FAIL) force backtrack; synonym (*F)
+.sp
+The following act only when a subsequent match failure causes a backtrack to
+reach them. They all force a match failure, but they differ in what happens
+afterwards. Those that advance the start-of-match point do so only if the
+pattern is not anchored.
+.sp
+ (*COMMIT) overall failure, no advance of starting point
+ (*PRUNE) advance to next starting character
+ (*SKIP) advance start to current matching position
+ (*THEN) local failure, backtrack to next alternation
+.
+.
+.SH "NEWLINE CONVENTIONS"
+.rs
+.sp
+These are recognized only at the very start of the pattern or after a
+(*BSR_...) option.
+.sp
+ (*CR)
+ (*LF)
+ (*CRLF)
+ (*ANYCRLF)
+ (*ANY)
+.
+.
+.SH "WHAT \eR MATCHES"
+.rs
+.sp
+These are recognized only at the very start of the pattern or after a
+(*...) option that sets the newline convention.
+.sp
+ (*BSR_ANYCRLF)
+ (*BSR_UNICODE)
+.
+.
+.SH "CALLOUTS"
+.rs
+.sp
+ (?C) callout
+ (?Cn) callout with data n
+.
+.
+.SH "SEE ALSO"
+.rs
+.sp
+\fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
+\fBpcrematching\fP(3), \fBpcre\fP(3).
+.
+.
+.SH AUTHOR
+.rs
+.sp
+.nf
+Philip Hazel
+University Computing Service
+Cambridge CB2 3QH, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 21 September 2007
+Copyright (c) 1997-2007 University of Cambridge.
+.fi
Modified: httpd/httpd/vendor/pcre/current/doc/pcretest.1
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/pcretest.1?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/pcretest.1 (original)
+++ httpd/httpd/vendor/pcre/current/doc/pcretest.1 Mon Nov 26 08:49:53 2007
@@ -4,10 +4,8 @@
.SH SYNOPSIS
.rs
.sp
-.B pcretest "[-C] [-d] [-i] [-m] [-o osize] [-p] [-t] [source]"
-.ti +5n
-.B "[destination]"
-.P
+.B pcretest "[options] [source] [destination]"
+.sp
\fBpcretest\fP was written as a test program for the PCRE regular expression
library itself, but it can also be used for experimenting with regular
expressions. This document describes the features of the test program; for
@@ -26,16 +24,29 @@
.SH OPTIONS
.rs
.TP 10
+\fB-b\fP
+Behave as if each regex has the \fB/B\fP (show bytecode) modifier; the internal
+form is output after compilation.
+.TP 10
\fB-C\fP
Output the version number of the PCRE library, and all available information
about the optional features that are included, and then exit.
.TP 10
\fB-d\fP
-Behave as if each regex had the \fB/D\fP (debug) modifier; the internal
-form is output after compilation.
+Behave as if each regex has the \fB/D\fP (debug) modifier; the internal
+form and information about the compiled pattern is output after compilation;
+\fB-d\fP is equivalent to \fB-b -i\fP.
+.TP 10
+\fB-dfa\fP
+Behave as if each data line contains the \eD escape sequence; this causes the
+alternative matching function, \fBpcre_dfa_exec()\fP, to be used instead of the
+standard \fBpcre_exec()\fP function (more detail is given below).
+.TP 10
+\fB-help\fP
+Output a brief summary these options and then exit.
.TP 10
\fB-i\fP
-Behave as if each regex had the \fB/I\fP modifier; information about the
+Behave as if each regex has the \fB/I\fP modifier; information about the
compiled pattern is given after compilation.
.TP 10
\fB-m\fP
@@ -45,19 +56,36 @@
.TP 10
\fB-o\fP \fIosize\fP
Set the number of elements in the output vector that is used when calling
-\fBpcre_exec()\fP to be \fIosize\fP. The default value is 45, which is enough
-for 14 capturing subexpressions. The vector size can be changed for individual
-matching calls by including \eO in the data line (see below).
+\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP to be \fIosize\fP. The default value
+is 45, which is enough for 14 capturing subexpressions for \fBpcre_exec()\fP or
+22 different matches for \fBpcre_dfa_exec()\fP. The vector size can be
+changed for individual matching calls by including \eO in the data line (see
+below).
.TP 10
\fB-p\fP
-Behave as if each regex has \fB/P\fP modifier; the POSIX wrapper API is used
-to call PCRE. None of the other options has any effect when \fB-p\fP is set.
+Behave as if each regex has the \fB/P\fP modifier; the POSIX wrapper API is
+used to call PCRE. None of the other options has any effect when \fB-p\fP is
+set.
+.TP 10
+\fB-q\fP
+Do not output the version number of \fBpcretest\fP at the start of execution.
+.TP 10
+\fB-S\fP \fIsize\fP
+On Unix-like systems, set the size of the runtime stack to \fIsize\fP
+megabytes.
.TP 10
\fB-t\fP
Run each compile, study, and match many times with a timer, and output
resulting time per compile or match (in milliseconds). Do not set \fB-m\fP with
\fB-t\fP, because you will then get the size output a zillion times, and the
-timing will be distorted.
+timing will be distorted. You can control the number of iterations that are
+used for timing by following \fB-t\fP with a number (as a separate item on the
+command line). For example, "-t 1000" would iterate 1000 times. The default is
+to iterate 500000 times.
+.TP 10
+\fB-tm\fP
+This is like \fB-t\fP except that it times only the matching phase, not the
+compile or study phases.
.
.
.SH DESCRIPTION
@@ -74,13 +102,14 @@
lines to be matched against the pattern.
.P
Each data line is matched separately and independently. If you want to do
-multiple-line matches, you have to use the \en escape sequence in a single line
-of input to encode the newline characters. The maximum length of data line is
-30,000 characters.
+multi-line matches, you have to use the \en escape sequence (or \er or \er\en,
+etc., depending on the newline setting) in a single line of input to encode the
+newline sequences. There is no limit on the length of data lines; the input
+buffer is automatically extended if it is too small.
.P
An empty line signals the end of the data lines, at which point a new regular
expression is read. The regular expressions are given enclosed in any
-non-alphanumeric delimiters other than backslash, for example
+non-alphanumeric delimiters other than backslash, for example:
.sp
/(a|bc)x+yz/
.sp
@@ -128,12 +157,37 @@
The following table shows additional modifiers for setting PCRE options that do
not correspond to anything in Perl:
.sp
- \fB/A\fP PCRE_ANCHORED
- \fB/C\fP PCRE_AUTO_CALLOUT
- \fB/E\fP PCRE_DOLLAR_ENDONLY
- \fB/N\fP PCRE_NO_AUTO_CAPTURE
- \fB/U\fP PCRE_UNGREEDY
- \fB/X\fP PCRE_EXTRA
+ \fB/A\fP PCRE_ANCHORED
+ \fB/C\fP PCRE_AUTO_CALLOUT
+ \fB/E\fP PCRE_DOLLAR_ENDONLY
+ \fB/f\fP PCRE_FIRSTLINE
+ \fB/J\fP PCRE_DUPNAMES
+ \fB/N\fP PCRE_NO_AUTO_CAPTURE
+ \fB/U\fP PCRE_UNGREEDY
+ \fB/X\fP PCRE_EXTRA
+ \fB/<cr>\fP PCRE_NEWLINE_CR
+ \fB/<lf>\fP PCRE_NEWLINE_LF
+ \fB/<crlf>\fP PCRE_NEWLINE_CRLF
+ \fB/<anycrlf>\fP PCRE_NEWLINE_ANYCRLF
+ \fB/<any>\fP PCRE_NEWLINE_ANY
+ \fB/<bsr_anycrlf>\fP PCRE_BSR_ANYCRLF
+ \fB/<bsr_unicode>\fP PCRE_BSR_UNICODE
+.sp
+Those specifying line ending sequences are literal strings as shown, but the
+letters can be in either case. This example sets multiline matching with CRLF
+as the line ending sequence:
+.sp
+ /^abc/m<crlf>
+.sp
+Details of the meanings of these PCRE options are given in the
+.\" HREF
+\fBpcreapi\fP
+.\"
+documentation.
+.
+.
+.SS "Finding all matches in a string"
+.rs
.sp
Searching for all possible matches within each subject string can be requested
by the \fB/g\fP or \fB/G\fP modifier. After finding a match, PCRE is called
@@ -150,7 +204,11 @@
If this second match fails, the start offset is advanced by one, and the normal
match is retried. This imitates the way Perl handles such cases when using the
\fB/g\fP modifier or the \fBsplit()\fP function.
-.P
+.
+.
+.SS "Other modifiers"
+.rs
+.sp
There are yet more modifiers for controlling the way \fBpcretest\fP
operates.
.P
@@ -159,6 +217,13 @@
the subject string. This is useful for tests where the subject contains
multiple copies of the same substring.
.P
+The \fB/B\fP modifier is a debugging feature. It requests that \fBpcretest\fP
+output a representation of the compiled byte code after compilation. Normally
+this information contains length and offset values; however, if \fB/Z\fP is
+also present, this data is replaced by spaces. This is a special feature for
+use in the automatic test scripts; it ensures that the same output is generated
+for different internal link sizes.
+.P
The \fB/L\fP modifier must be followed directly by the name of a locale, for
example,
.sp
@@ -175,10 +240,8 @@
so on). It does this by calling \fBpcre_fullinfo()\fP after compiling a
pattern. If the pattern is studied, the results of that are also output.
.P
-The \fB/D\fP modifier is a PCRE debugging feature, which also assumes \fB/I\fP.
-It causes the internal form of compiled regular expressions to be output after
-compilation. If the pattern was studied, the information returned is also
-output.
+The \fB/D\fP modifier is a PCRE debugging feature, and is equivalent to
+\fB/BI\fP, that is, both the \fB/B\fP and the \fB/I\fP modifiers.
.P
The \fB/F\fP modifier causes \fBpcretest\fP to flip the byte order of the
fields in the compiled pattern that contain 2-byte and 4-byte numbers. This
@@ -222,21 +285,28 @@
expressions, you probably don't need any of these. The following escapes are
recognized:
.sp
- \ea alarm (= BEL)
- \eb backspace
- \ee escape
- \ef formfeed
- \en newline
- \er carriage return
- \et tab
- \ev vertical tab
+ \ea alarm (BEL, \ex07)
+ \eb backspace (\ex08)
+ \ee escape (\ex27)
+ \ef formfeed (\ex0c)
+ \en newline (\ex0a)
+.\" JOIN
+ \eqdd set the PCRE_MATCH_LIMIT limit to dd
+ (any number of digits)
+ \er carriage return (\ex0d)
+ \et tab (\ex09)
+ \ev vertical tab (\ex0b)
\ennn octal character (up to 3 octal digits)
\exhh hexadecimal character (up to 2 hex digits)
.\" JOIN
\ex{hh...} hexadecimal character, any number of digits
in UTF-8 mode
+.\" JOIN
\eA pass the PCRE_ANCHORED option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
\eB pass the PCRE_NOTBOL option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
.\" JOIN
\eCdd call pcre_copy_substring() for substring dd
after a successful match (number less than 32)
@@ -257,6 +327,8 @@
.\" JOIN
\eC*n pass the number n (may be negative) as callout
data; this is used as the callout return value
+ \eD use the \fBpcre_dfa_exec()\fP match function
+ \eF only shortest match for \fBpcre_dfa_exec()\fP
.\" JOIN
\eGdd call pcre_get_substring() for substring dd
after a successful match (number less than 32)
@@ -267,59 +339,122 @@
.\" JOIN
\eL call pcre_get_substringlist() after a
successful match
- \eM discover the minimum MATCH_LIMIT setting
+.\" JOIN
+ \eM discover the minimum MATCH_LIMIT and
+ MATCH_LIMIT_RECURSION settings
+.\" JOIN
\eN pass the PCRE_NOTEMPTY option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
.\" JOIN
\eOdd set the size of the output vector passed to
\fBpcre_exec()\fP to dd (any number of digits)
+.\" JOIN
\eP pass the PCRE_PARTIAL option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
+ \eQdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
+ (any number of digits)
+ \eR pass the PCRE_DFA_RESTART option to \fBpcre_dfa_exec()\fP
\eS output details of memory get/free calls during matching
+.\" JOIN
\eZ pass the PCRE_NOTEOL option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
.\" JOIN
\e? pass the PCRE_NO_UTF8_CHECK option to
- \fBpcre_exec()\fP
+ \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP
\e>dd start the match at offset dd (any number of digits);
+.\" JOIN
this sets the \fIstartoffset\fP argument for \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
+ \e<cr> pass the PCRE_NEWLINE_CR option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
+ \e<lf> pass the PCRE_NEWLINE_LF option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
+ \e<crlf> pass the PCRE_NEWLINE_CRLF option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
+ \e<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
+ \e<any> pass the PCRE_NEWLINE_ANY option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
.sp
-A backslash followed by anything else just escapes the anything else. If the
-very last character is a backslash, it is ignored. This gives a way of passing
-an empty line as data, since a real empty line terminates the data input.
+The escapes that specify line ending sequences are literal strings, exactly as
+shown. No more than one newline setting should be present in any data line.
+.P
+A backslash followed by anything else just escapes the anything else. If
+the very last character is a backslash, it is ignored. This gives a way of
+passing an empty line as data, since a real empty line terminates the data
+input.
.P
If \eM is present, \fBpcretest\fP calls \fBpcre_exec()\fP several times, with
-different values in the \fImatch_limit\fP field of the \fBpcre_extra\fP data
-structure, until it finds the minimum number that is needed for
-\fBpcre_exec()\fP to complete. This number is a measure of the amount of
-recursion and backtracking that takes place, and checking it out can be
-instructive. For most simple matches, the number is quite small, but for
-patterns with very large numbers of matching possibilities, it can become large
-very quickly with increasing length of subject string.
+different values in the \fImatch_limit\fP and \fImatch_limit_recursion\fP
+fields of the \fBpcre_extra\fP data structure, until it finds the minimum
+numbers for each parameter that allow \fBpcre_exec()\fP to complete. The
+\fImatch_limit\fP number is a measure of the amount of backtracking that takes
+place, and checking it out can be instructive. For most simple matches, the
+number is quite small, but for patterns with very large numbers of matching
+possibilities, it can become large very quickly with increasing length of
+subject string. The \fImatch_limit_recursion\fP number is a measure of how much
+stack (or, if PCRE is compiled with NO_RECURSE, how much heap) memory is needed
+to complete the match attempt.
.P
When \eO is used, the value specified may be higher or lower than the size set
by the \fB-O\fP command line option (or defaulted to 45); \eO applies only to
the call of \fBpcre_exec()\fP for the line in which it appears.
.P
If the \fB/P\fP modifier was present on the pattern, causing the POSIX wrapper
-API to be used, only \eB and \eZ have any effect, causing REG_NOTBOL and
-REG_NOTEOL to be passed to \fBregexec()\fP respectively.
+API to be used, the only option-setting sequences that have any effect are \eB
+and \eZ, causing REG_NOTBOL and REG_NOTEOL, respectively, to be passed to
+\fBregexec()\fP.
.P
The use of \ex{hh...} to represent UTF-8 characters is not dependent on the use
of the \fB/8\fP modifier on the pattern. It is recognized always. There may be
any number of hexadecimal digits inside the braces. The result is from one to
-six bytes, encoded according to the UTF-8 rules.
+six bytes, encoded according to the original UTF-8 rules of RFC 2279. This
+allows for values in the range 0 to 0x7FFFFFFF. Note that not all of those are
+valid Unicode code points, or indeed valid UTF-8 characters according to the
+later rules in RFC 3629.
.
.
-.SH "OUTPUT FROM PCRETEST"
+.SH "THE ALTERNATIVE MATCHING FUNCTION"
.rs
.sp
+By default, \fBpcretest\fP uses the standard PCRE matching function,
+\fBpcre_exec()\fP to match each data line. From release 6.0, PCRE supports an
+alternative matching function, \fBpcre_dfa_test()\fP, which operates in a
+different way, and has some restrictions. The differences between the two
+functions are described in the
+.\" HREF
+\fBpcrematching\fP
+.\"
+documentation.
+.P
+If a data line contains the \eD escape sequence, or if the command line
+contains the \fB-dfa\fP option, the alternative matching function is called.
+This function finds all possible matches at a given point. If, however, the \eF
+escape sequence is present in the data line, it stops after the first match is
+found. This is always the shortest possible match.
+.
+.
+.SH "DEFAULT OUTPUT FROM PCRETEST"
+.rs
+.sp
+This section describes the output when the normal matching function,
+\fBpcre_exec()\fP, is being used.
+.P
When a match succeeds, pcretest outputs the list of captured substrings that
\fBpcre_exec()\fP returns, starting with number 0 for the string that matched
the whole pattern. Otherwise, it outputs "No match" or "Partial match"
when \fBpcre_exec()\fP returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL,
respectively, and otherwise the PCRE negative error number. Here is an example
-of an interactive pcretest run.
+of an interactive \fBpcretest\fP run.
.sp
$ pcretest
- PCRE version 5.00 07-Sep-2004
+ PCRE version 7.0 30-Nov-2006
.sp
re> /^abc(\ed+)/
data> abc123
@@ -330,9 +465,9 @@
.sp
If the strings contain any non-printing characters, they are output as \e0x
escapes, or as \ex{...} escapes if the \fB/8\fP modifier was present on the
-pattern. If the pattern has the \fB/+\fP modifier, the output for substring 0
-is followed by the the rest of the subject string, identified by "0+" like
-this:
+pattern. See below for the definition of non-printing characters. If the
+pattern has the \fB/+\fP modifier, the output for substring 0 is followed by
+the the rest of the subject string, identified by "0+" like this:
.sp
re> /cat/+
data> cataract
@@ -360,18 +495,75 @@
length (that is, the return from the extraction function) is given in
parentheses after each string for \fB\eC\fP and \fB\eG\fP.
.P
-Note that while patterns can be continued over several lines (a plain ">"
+Note that whereas patterns can be continued over several lines (a plain ">"
prompt is used for continuations), data lines may not. However newlines can be
-included in data by means of the \en escape.
+included in data by means of the \en escape (or \er, \er\en, etc., depending on
+the newline sequence setting).
+.
+.
+.
+.SH "OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION"
+.rs
+.sp
+When the alternative matching function, \fBpcre_dfa_exec()\fP, is used (by
+means of the \eD escape sequence or the \fB-dfa\fP command line option), the
+output consists of a list of all the matches that start at the first point in
+the subject where there is at least one match. For example:
+.sp
+ re> /(tang|tangerine|tan)/
+ data> yellow tangerine\eD
+ 0: tangerine
+ 1: tang
+ 2: tan
+.sp
+(Using the normal matching function on this data finds only "tang".) The
+longest matching string is always given first (and numbered zero).
+.P
+If \fB/g\fP is present on the pattern, the search for further matches resumes
+at the end of the longest match. For example:
+.sp
+ re> /(tang|tangerine|tan)/g
+ data> yellow tangerine and tangy sultana\eD
+ 0: tangerine
+ 1: tang
+ 2: tan
+ 0: tang
+ 1: tan
+ 0: tan
+.sp
+Since the matching function does not support substring capture, the escape
+sequences that are concerned with captured substrings are not relevant.
+.
+.
+.SH "RESTARTING AFTER A PARTIAL MATCH"
+.rs
+.sp
+When the alternative matching function has given the PCRE_ERROR_PARTIAL return,
+indicating that the subject partially matched the pattern, you can restart the
+match with additional subject data by means of the \eR escape sequence. For
+example:
+.sp
+ re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
+ data> 23ja\eP\eD
+ Partial match: 23ja
+ data> n05\eR\eD
+ 0: n05
+.sp
+For further information about partial matching, see the
+.\" HREF
+\fBpcrepartial\fP
+.\"
+documentation.
.
.
.SH CALLOUTS
.rs
.sp
If the pattern contains any callout requests, \fBpcretest\fP's callout function
-is called during matching. By default, it displays the callout number, the
-start and current positions in the text at the callout time, and the next
-pattern item to be tested. For example, the output
+is called during matching. This works with both matching functions. By default,
+the called function displays the callout number, the start and current
+positions in the text at the callout time, and the next pattern item to be
+tested. For example, the output
.sp
--->pqrabcdef
0 ^ ^ \ed
@@ -396,7 +588,7 @@
0: E*
.sp
The callout function in \fBpcretest\fP returns zero (carry on matching) by
-default, but you can use an \eC item in a data line (as described above) to
+default, but you can use a \eC item in a data line (as described above) to
change this.
.P
Inserting callouts can be helpful when using \fBpcretest\fP to check
@@ -408,6 +600,21 @@
documentation.
.
.
+.
+.SH "NON-PRINTING CHARACTERS"
+.rs
+.sp
+When \fBpcretest\fP is outputting text in the compiled version of a pattern,
+bytes other than 32-126 are always treated as non-printing characters are are
+therefore shown as hex escapes.
+.P
+When \fBpcretest\fP is outputting text that is a matched part of a subject
+string, it behaves in the same way, unless a different locale has been set for
+the pattern (using the \fB/L\fP modifier). In this case, the \fBisprint()\fP
+function to distinguish printing and non-printing characters.
+.
+.
+.
.SH "SAVING AND RELOADING COMPILED PATTERNS"
.rs
.sp
@@ -468,16 +675,27 @@
result is undefined.
.
.
+.SH "SEE ALSO"
+.rs
+.sp
+\fBpcre\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
+\fBpcrepartial\fP(d), \fBpcrepattern\fP(3), \fBpcreprecompile\fP(3).
+.
+.
.SH AUTHOR
.rs
.sp
-Philip Hazel <ph...@cam.ac.uk>
-.br
-University Computing Service,
-.br
-Cambridge CB2 3QG, England.
-.P
-.in 0
-Last updated: 10 September 2004
-.br
-Copyright (c) 1997-2004 University of Cambridge.
+.nf
+Philip Hazel
+University Computing Service
+Cambridge CB2 3QH, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 11 September 2007
+Copyright (c) 1997-2007 University of Cambridge.
+.fi
Modified: httpd/httpd/vendor/pcre/current/doc/pcretest.txt
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/pcretest.txt?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/pcretest.txt (original)
+++ httpd/httpd/vendor/pcre/current/doc/pcretest.txt Mon Nov 26 08:49:53 2007
@@ -1,14 +1,13 @@
PCRETEST(1) PCRETEST(1)
-
NAME
pcretest - a program for testing Perl-compatible regular expressions.
+
SYNOPSIS
- pcretest [-C] [-d] [-i] [-m] [-o osize] [-p] [-t] [source]
- [destination]
+ pcretest [options] [source] [destination]
pcretest was written as a test program for the PCRE regular expression
library itself, but it can also be used for experimenting with regular
@@ -20,99 +19,126 @@
OPTIONS
+ -b Behave as if each regex has the /B (show bytecode) modifier;
+ the internal form is output after compilation.
+
-C Output the version number of the PCRE library, and all avail-
- able information about the optional features that are
+ able information about the optional features that are
included, and then exit.
- -d Behave as if each regex had the /D (debug) modifier; the
- internal form is output after compilation.
+ -d Behave as if each regex has the /D (debug) modifier; the
+ internal form and information about the compiled pattern is
+ output after compilation; -d is equivalent to -b -i.
+
+ -dfa Behave as if each data line contains the \D escape sequence;
+ this causes the alternative matching function,
+ pcre_dfa_exec(), to be used instead of the standard
+ pcre_exec() function (more detail is given below).
- -i Behave as if each regex had the /I modifier; information
+ -help Output a brief summary these options and then exit.
+
+ -i Behave as if each regex has the /I modifier; information
about the compiled pattern is given after compilation.
- -m Output the size of each compiled pattern after it has been
- compiled. This is equivalent to adding /M to each regular
- expression. For compatibility with earlier versions of
+ -m Output the size of each compiled pattern after it has been
+ compiled. This is equivalent to adding /M to each regular
+ expression. For compatibility with earlier versions of
pcretest, -s is a synonym for -m.
- -o osize Set the number of elements in the output vector that is used
- when calling pcre_exec() to be osize. The default value is
- 45, which is enough for 14 capturing subexpressions. The vec-
- tor size can be changed for individual matching calls by
- including \O in the data line (see below).
-
- -p Behave as if each regex has /P modifier; the POSIX wrapper
- API is used to call PCRE. None of the other options has any
- effect when -p is set.
-
- -t Run each compile, study, and match many times with a timer,
- and output resulting time per compile or match (in millisec-
- onds). Do not set -m with -t, because you will then get the
- size output a zillion times, and the timing will be dis-
- torted.
+ -o osize Set the number of elements in the output vector that is used
+ when calling pcre_exec() or pcre_dfa_exec() to be osize. The
+ default value is 45, which is enough for 14 capturing subex-
+ pressions for pcre_exec() or 22 different matches for
+ pcre_dfa_exec(). The vector size can be changed for individ-
+ ual matching calls by including \O in the data line (see
+ below).
+
+ -p Behave as if each regex has the /P modifier; the POSIX wrap-
+ per API is used to call PCRE. None of the other options has
+ any effect when -p is set.
+
+ -q Do not output the version number of pcretest at the start of
+ execution.
+
+ -S size On Unix-like systems, set the size of the runtime stack to
+ size megabytes.
+
+ -t Run each compile, study, and match many times with a timer,
+ and output resulting time per compile or match (in millisec-
+ onds). Do not set -m with -t, because you will then get the
+ size output a zillion times, and the timing will be dis-
+ torted. You can control the number of iterations that are
+ used for timing by following -t with a number (as a separate
+ item on the command line). For example, "-t 1000" would iter-
+ ate 1000 times. The default is to iterate 500000 times.
+
+ -tm This is like -t except that it times only the matching phase,
+ not the compile or study phases.
DESCRIPTION
- If pcretest is given two filename arguments, it reads from the first
+ If pcretest is given two filename arguments, it reads from the first
and writes to the second. If it is given only one filename argument, it
- reads from that file and writes to stdout. Otherwise, it reads from
- stdin and writes to stdout, and prompts for each line of input, using
+ reads from that file and writes to stdout. Otherwise, it reads from
+ stdin and writes to stdout, and prompts for each line of input, using
"re>" to prompt for regular expressions, and "data>" to prompt for data
lines.
The program handles any number of sets of input on a single input file.
- Each set starts with a regular expression, and continues with any num-
+ Each set starts with a regular expression, and continues with any num-
ber of data lines to be matched against the pattern.
- Each data line is matched separately and independently. If you want to
- do multiple-line matches, you have to use the \n escape sequence in a
- single line of input to encode the newline characters. The maximum
- length of data line is 30,000 characters.
-
- An empty line signals the end of the data lines, at which point a new
- regular expression is read. The regular expressions are given enclosed
- in any non-alphanumeric delimiters other than backslash, for example
+ Each data line is matched separately and independently. If you want to
+ do multi-line matches, you have to use the \n escape sequence (or \r or
+ \r\n, etc., depending on the newline setting) in a single line of input
+ to encode the newline sequences. There is no limit on the length of
+ data lines; the input buffer is automatically extended if it is too
+ small.
+
+ An empty line signals the end of the data lines, at which point a new
+ regular expression is read. The regular expressions are given enclosed
+ in any non-alphanumeric delimiters other than backslash, for example:
/(a|bc)x+yz/
- White space before the initial delimiter is ignored. A regular expres-
- sion may be continued over several input lines, in which case the new-
- line characters are included within it. It is possible to include the
+ White space before the initial delimiter is ignored. A regular expres-
+ sion may be continued over several input lines, in which case the new-
+ line characters are included within it. It is possible to include the
delimiter within the pattern by escaping it, for example
/abc\/def/
- If you do so, the escape and the delimiter form part of the pattern,
- but since delimiters are always non-alphanumeric, this does not affect
- its interpretation. If the terminating delimiter is immediately fol-
+ If you do so, the escape and the delimiter form part of the pattern,
+ but since delimiters are always non-alphanumeric, this does not affect
+ its interpretation. If the terminating delimiter is immediately fol-
lowed by a backslash, for example,
/abc/\
- then a backslash is added to the end of the pattern. This is done to
- provide a way of testing the error condition that arises if a pattern
+ then a backslash is added to the end of the pattern. This is done to
+ provide a way of testing the error condition that arises if a pattern
finishes with a backslash, because
/abc\/
- is interpreted as the first line of a pattern that starts with "abc/",
+ is interpreted as the first line of a pattern that starts with "abc/",
causing pcretest to read the next line as a continuation of the regular
expression.
PATTERN MODIFIERS
- A pattern may be followed by any number of modifiers, which are mostly
- single characters. Following Perl usage, these are referred to below
- as, for example, "the /i modifier", even though the delimiter of the
- pattern need not always be a slash, and no slash is used when writing
- modifiers. Whitespace may appear between the final pattern delimiter
+ A pattern may be followed by any number of modifiers, which are mostly
+ single characters. Following Perl usage, these are referred to below
+ as, for example, "the /i modifier", even though the delimiter of the
+ pattern need not always be a slash, and no slash is used when writing
+ modifiers. Whitespace may appear between the final pattern delimiter
and the first modifier, and between the modifiers themselves.
The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE,
- PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com-
- pile() is called. These four modifier letters have the same effect as
+ PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com-
+ pile() is called. These four modifier letters have the same effect as
they do in Perl. For example:
/caseless/i
@@ -120,12 +146,32 @@
The following table shows additional modifiers for setting PCRE options
that do not correspond to anything in Perl:
- /A PCRE_ANCHORED
- /C PCRE_AUTO_CALLOUT
- /E PCRE_DOLLAR_ENDONLY
- /N PCRE_NO_AUTO_CAPTURE
- /U PCRE_UNGREEDY
- /X PCRE_EXTRA
+ /A PCRE_ANCHORED
+ /C PCRE_AUTO_CALLOUT
+ /E PCRE_DOLLAR_ENDONLY
+ /f PCRE_FIRSTLINE
+ /J PCRE_DUPNAMES
+ /N PCRE_NO_AUTO_CAPTURE
+ /U PCRE_UNGREEDY
+ /X PCRE_EXTRA
+ /<cr> PCRE_NEWLINE_CR
+ /<lf> PCRE_NEWLINE_LF
+ /<crlf> PCRE_NEWLINE_CRLF
+ /<anycrlf> PCRE_NEWLINE_ANYCRLF
+ /<any> PCRE_NEWLINE_ANY
+ /<bsr_anycrlf> PCRE_BSR_ANYCRLF
+ /<bsr_unicode> PCRE_BSR_UNICODE
+
+ Those specifying line ending sequences are literal strings as shown,
+ but the letters can be in either case. This example sets multiline
+ matching with CRLF as the line ending sequence:
+
+ /^abc/m<crlf>
+
+ Details of the meanings of these PCRE options are given in the pcreapi
+ documentation.
+
+ Finding all matches in a string
Searching for all possible matches within each subject string can be
requested by the /g or /G modifier. After finding a match, PCRE is
@@ -144,6 +190,8 @@
one, and the normal match is retried. This imitates the way Perl han-
dles such cases when using the /g modifier or the split() function.
+ Other modifiers
+
There are yet more modifiers for controlling the way pcretest operates.
The /+ modifier requests that as well as outputting the substring that
@@ -151,83 +199,92 @@
remainder of the subject string. This is useful for tests where the
subject contains multiple copies of the same substring.
- The /L modifier must be followed directly by the name of a locale, for
+ The /B modifier is a debugging feature. It requests that pcretest out-
+ put a representation of the compiled byte code after compilation. Nor-
+ mally this information contains length and offset values; however, if
+ /Z is also present, this data is replaced by spaces. This is a special
+ feature for use in the automatic test scripts; it ensures that the same
+ output is generated for different internal link sizes.
+
+ The /L modifier must be followed directly by the name of a locale, for
example,
/pattern/Lfr_FR
For this reason, it must be the last modifier. The given locale is set,
- pcre_maketables() is called to build a set of character tables for the
- locale, and this is then passed to pcre_compile() when compiling the
- regular expression. Without an /L modifier, NULL is passed as the
- tables pointer; that is, /L applies only to the expression on which it
+ pcre_maketables() is called to build a set of character tables for the
+ locale, and this is then passed to pcre_compile() when compiling the
+ regular expression. Without an /L modifier, NULL is passed as the
+ tables pointer; that is, /L applies only to the expression on which it
appears.
- The /I modifier requests that pcretest output information about the
- compiled pattern (whether it is anchored, has a fixed first character,
- and so on). It does this by calling pcre_fullinfo() after compiling a
- pattern. If the pattern is studied, the results of that are also out-
+ The /I modifier requests that pcretest output information about the
+ compiled pattern (whether it is anchored, has a fixed first character,
+ and so on). It does this by calling pcre_fullinfo() after compiling a
+ pattern. If the pattern is studied, the results of that are also out-
put.
- The /D modifier is a PCRE debugging feature, which also assumes /I. It
- causes the internal form of compiled regular expressions to be output
- after compilation. If the pattern was studied, the information returned
- is also output.
+ The /D modifier is a PCRE debugging feature, and is equivalent to /BI,
+ that is, both the /B and the /I modifiers.
The /F modifier causes pcretest to flip the byte order of the fields in
- the compiled pattern that contain 2-byte and 4-byte numbers. This
- facility is for testing the feature in PCRE that allows it to execute
+ the compiled pattern that contain 2-byte and 4-byte numbers. This
+ facility is for testing the feature in PCRE that allows it to execute
patterns that were compiled on a host with a different endianness. This
- feature is not available when the POSIX interface to PCRE is being
- used, that is, when the /P pattern modifier is specified. See also the
+ feature is not available when the POSIX interface to PCRE is being
+ used, that is, when the /P pattern modifier is specified. See also the
section about saving and reloading compiled patterns below.
- The /S modifier causes pcre_study() to be called after the expression
+ The /S modifier causes pcre_study() to be called after the expression
has been compiled, and the results used when the expression is matched.
- The /M modifier causes the size of memory block used to hold the com-
+ The /M modifier causes the size of memory block used to hold the com-
piled pattern to be output.
- The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
- rather than its native API. When this is done, all other modifiers
- except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
- and REG_NEWLINE is set if /m is present. The wrapper functions force
- PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
-
- The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
- set. This turns on support for UTF-8 character handling in PCRE, pro-
- vided that it was compiled with this support enabled. This modifier
+ The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+ rather than its native API. When this is done, all other modifiers
+ except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
+ and REG_NEWLINE is set if /m is present. The wrapper functions force
+ PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
+
+ The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
+ set. This turns on support for UTF-8 character handling in PCRE, pro-
+ vided that it was compiled with this support enabled. This modifier
also causes any non-printing characters in output strings to be printed
using the \x{hh...} notation if they are valid UTF-8 sequences.
- If the /? modifier is used with /8, it causes pcretest to call
- pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
+ If the /? modifier is used with /8, it causes pcretest to call
+ pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
checking of the string for UTF-8 validity.
DATA LINES
- Before each data line is passed to pcre_exec(), leading and trailing
- whitespace is removed, and it is then scanned for \ escapes. Some of
- these are pretty esoteric features, intended for checking out some of
- the more complicated features of PCRE. If you are just testing "ordi-
- nary" regular expressions, you probably don't need any of these. The
+ Before each data line is passed to pcre_exec(), leading and trailing
+ whitespace is removed, and it is then scanned for \ escapes. Some of
+ these are pretty esoteric features, intended for checking out some of
+ the more complicated features of PCRE. If you are just testing "ordi-
+ nary" regular expressions, you probably don't need any of these. The
following escapes are recognized:
- \a alarm (= BEL)
- \b backspace
- \e escape
- \f formfeed
- \n newline
- \r carriage return
- \t tab
- \v vertical tab
+ \a alarm (BEL, \x07)
+ \b backspace (\x08)
+ \e escape (\x27)
+ \f formfeed (\x0c)
+ \n newline (\x0a)
+ \qdd set the PCRE_MATCH_LIMIT limit to dd
+ (any number of digits)
+ \r carriage return (\x0d)
+ \t tab (\x09)
+ \v vertical tab (\x0b)
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\x{hh...} hexadecimal character, any number of digits
in UTF-8 mode
\A pass the PCRE_ANCHORED option to pcre_exec()
+ or pcre_dfa_exec()
\B pass the PCRE_NOTBOL option to pcre_exec()
+ or pcre_dfa_exec()
\Cdd call pcre_copy_substring() for substring dd
after a successful match (number less than 32)
\Cname call pcre_copy_named_substring() for substring
@@ -242,6 +299,8 @@
reached for the nth time
\C*n pass the number n (may be negative) as callout
data; this is used as the callout return value
+ \D use the pcre_dfa_exec() match function
+ \F only shortest match for pcre_dfa_exec()
\Gdd call pcre_get_substring() for substring dd
after a successful match (number less than 32)
\Gname call pcre_get_named_substring() for substring
@@ -249,57 +308,105 @@
ated by next non-alphanumeric character)
\L call pcre_get_substringlist() after a
successful match
- \M discover the minimum MATCH_LIMIT setting
+ \M discover the minimum MATCH_LIMIT and
+ MATCH_LIMIT_RECURSION settings
\N pass the PCRE_NOTEMPTY option to pcre_exec()
+ or pcre_dfa_exec()
\Odd set the size of the output vector passed to
pcre_exec() to dd (any number of digits)
\P pass the PCRE_PARTIAL option to pcre_exec()
+ or pcre_dfa_exec()
+ \Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
+ (any number of digits)
+ \R pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
\S output details of memory get/free calls during matching
\Z pass the PCRE_NOTEOL option to pcre_exec()
+ or pcre_dfa_exec()
\? pass the PCRE_NO_UTF8_CHECK option to
- pcre_exec()
+ pcre_exec() or pcre_dfa_exec()
\>dd start the match at offset dd (any number of digits);
this sets the startoffset argument for pcre_exec()
-
- A backslash followed by anything else just escapes the anything else.
- If the very last character is a backslash, it is ignored. This gives a
- way of passing an empty line as data, since a real empty line termi-
+ or pcre_dfa_exec()
+ \<cr> pass the PCRE_NEWLINE_CR option to pcre_exec()
+ or pcre_dfa_exec()
+ \<lf> pass the PCRE_NEWLINE_LF option to pcre_exec()
+ or pcre_dfa_exec()
+ \<crlf> pass the PCRE_NEWLINE_CRLF option to pcre_exec()
+ or pcre_dfa_exec()
+ \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre_exec()
+ or pcre_dfa_exec()
+ \<any> pass the PCRE_NEWLINE_ANY option to pcre_exec()
+ or pcre_dfa_exec()
+
+ The escapes that specify line ending sequences are literal strings,
+ exactly as shown. No more than one newline setting should be present in
+ any data line.
+
+ A backslash followed by anything else just escapes the anything else.
+ If the very last character is a backslash, it is ignored. This gives a
+ way of passing an empty line as data, since a real empty line termi-
nates the data input.
- If \M is present, pcretest calls pcre_exec() several times, with dif-
- ferent values in the match_limit field of the pcre_extra data struc-
- ture, until it finds the minimum number that is needed for pcre_exec()
- to complete. This number is a measure of the amount of recursion and
- backtracking that takes place, and checking it out can be instructive.
- For most simple matches, the number is quite small, but for patterns
- with very large numbers of matching possibilities, it can become large
- very quickly with increasing length of subject string.
+ If \M is present, pcretest calls pcre_exec() several times, with dif-
+ ferent values in the match_limit and match_limit_recursion fields of
+ the pcre_extra data structure, until it finds the minimum numbers for
+ each parameter that allow pcre_exec() to complete. The match_limit num-
+ ber is a measure of the amount of backtracking that takes place, and
+ checking it out can be instructive. For most simple matches, the number
+ is quite small, but for patterns with very large numbers of matching
+ possibilities, it can become large very quickly with increasing length
+ of subject string. The match_limit_recursion number is a measure of how
+ much stack (or, if PCRE is compiled with NO_RECURSE, how much heap)
+ memory is needed to complete the match attempt.
When \O is used, the value specified may be higher or lower than the
size set by the -O command line option (or defaulted to 45); \O applies
only to the call of pcre_exec() for the line in which it appears.
If the /P modifier was present on the pattern, causing the POSIX wrap-
- per API to be used, only \B and \Z have any effect, causing REG_NOTBOL
- and REG_NOTEOL to be passed to regexec() respectively.
+ per API to be used, the only option-setting sequences that have any
+ effect are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively,
+ to be passed to regexec().
+
+ The use of \x{hh...} to represent UTF-8 characters is not dependent on
+ the use of the /8 modifier on the pattern. It is recognized always.
+ There may be any number of hexadecimal digits inside the braces. The
+ result is from one to six bytes, encoded according to the original
+ UTF-8 rules of RFC 2279. This allows for values in the range 0 to
+ 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
+ or indeed valid UTF-8 characters according to the later rules in RFC
+ 3629.
+
+
+THE ALTERNATIVE MATCHING FUNCTION
+
+ By default, pcretest uses the standard PCRE matching function,
+ pcre_exec() to match each data line. From release 6.0, PCRE supports an
+ alternative matching function, pcre_dfa_test(), which operates in a
+ different way, and has some restrictions. The differences between the
+ two functions are described in the pcrematching documentation.
+
+ If a data line contains the \D escape sequence, or if the command line
+ contains the -dfa option, the alternative matching function is called.
+ This function finds all possible matches at a given point. If, however,
+ the \F escape sequence is present in the data line, it stops after the
+ first match is found. This is always the shortest possible match.
- The use of \x{hh...} to represent UTF-8 characters is not dependent on
- the use of the /8 modifier on the pattern. It is recognized always.
- There may be any number of hexadecimal digits inside the braces. The
- result is from one to six bytes, encoded according to the UTF-8 rules.
+DEFAULT OUTPUT FROM PCRETEST
-OUTPUT FROM PCRETEST
+ This section describes the output when the normal matching function,
+ pcre_exec(), is being used.
When a match succeeds, pcretest outputs the list of captured substrings
- that pcre_exec() returns, starting with number 0 for the string that
+ that pcre_exec() returns, starting with number 0 for the string that
matched the whole pattern. Otherwise, it outputs "No match" or "Partial
- match" when pcre_exec() returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PAR-
- TIAL, respectively, and otherwise the PCRE negative error number. Here
+ match" when pcre_exec() returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PAR-
+ TIAL, respectively, and otherwise the PCRE negative error number. Here
is an example of an interactive pcretest run.
$ pcretest
- PCRE version 5.00 07-Sep-2004
+ PCRE version 7.0 30-Nov-2006
re> /^abc(\d+)/
data> abc123
@@ -308,11 +415,12 @@
data> xyz
No match
- If the strings contain any non-printing characters, they are output as
- \0x escapes, or as \x{...} escapes if the /8 modifier was present on
- the pattern. If the pattern has the /+ modifier, the output for sub-
- string 0 is followed by the the rest of the subject string, identified
- by "0+" like this:
+ If the strings contain any non-printing characters, they are output as
+ \0x escapes, or as \x{...} escapes if the /8 modifier was present on
+ the pattern. See below for the definition of non-printing characters.
+ If the pattern has the /+ modifier, the output for substring 0 is fol-
+ lowed by the the rest of the subject string, identified by "0+" like
+ this:
re> /cat/+
data> cataract
@@ -340,17 +448,69 @@
(that is, the return from the extraction function) is given in paren-
theses after each string for \C and \G.
- Note that while patterns can be continued over several lines (a plain
+ Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), data lines may not. However new-
- lines can be included in data by means of the \n escape.
+ lines can be included in data by means of the \n escape (or \r, \r\n,
+ etc., depending on the newline sequence setting).
+
+
+OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
+
+ When the alternative matching function, pcre_dfa_exec(), is used (by
+ means of the \D escape sequence or the -dfa command line option), the
+ output consists of a list of all the matches that start at the first
+ point in the subject where there is at least one match. For example:
+
+ re> /(tang|tangerine|tan)/
+ data> yellow tangerine\D
+ 0: tangerine
+ 1: tang
+ 2: tan
+
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
+
+ If /g is present on the pattern, the search for further matches resumes
+ at the end of the longest match. For example:
+
+ re> /(tang|tangerine|tan)/g
+ data> yellow tangerine and tangy sultana\D
+ 0: tangerine
+ 1: tang
+ 2: tan
+ 0: tang
+ 1: tan
+ 0: tan
+
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
+ relevant.
+
+
+RESTARTING AFTER A PARTIAL MATCH
+
+ When the alternative matching function has given the PCRE_ERROR_PARTIAL
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
+ escape sequence. For example:
+
+ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+ data> 23ja\P\D
+ Partial match: 23ja
+ data> n05\R\D
+ 0: n05
+
+ For further information about partial matching, see the pcrepartial
+ documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. By default, it displays the callout
- number, the start and current positions in the text at the callout
- time, and the next pattern item to be tested. For example, the output
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
+ tions. By default, the called function displays the callout number, the
+ start and current positions in the text at the callout time, and the
+ next pattern item to be tested. For example, the output
--->pqrabcdef
0 ^ ^ \d
@@ -376,7 +536,7 @@
0: E*
The callout function in pcretest returns zero (carry on matching) by
- default, but you can use an \C item in a data line (as described above)
+ default, but you can use a \C item in a data line (as described above)
to change this.
Inserting callouts can be helpful when using pcretest to check compli-
@@ -384,6 +544,18 @@
the pcrecallout documentation.
+NON-PRINTING CHARACTERS
+
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
+ are are therefore shown as hex escapes.
+
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
+ isprint() function to distinguish printing and non-printing characters.
+
+
SAVING AND RELOADING COMPILED PATTERNS
The facilities described in this section are not available when the
@@ -440,11 +612,20 @@
a file that is not in the correct format, the result is undefined.
+SEE ALSO
+
+ pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcrepartial(d),
+ pcrepattern(3), pcreprecompile(3).
+
+
AUTHOR
- Philip Hazel <ph...@cam.ac.uk>
- University Computing Service,
- Cambridge CB2 3QG, England.
+ Philip Hazel
+ University Computing Service
+ Cambridge CB2 3QH, England.
+
+
+REVISION
-Last updated: 10 September 2004
-Copyright (c) 1997-2004 University of Cambridge.
+ Last updated: 11 September 2007
+ Copyright (c) 1997-2007 University of Cambridge.
Modified: httpd/httpd/vendor/pcre/current/doc/perltest.txt
URL: http://svn.apache.org/viewvc/httpd/httpd/vendor/pcre/current/doc/perltest.txt?rev=598339&r1=598338&r2=598339&view=diff
==============================================================================
--- httpd/httpd/vendor/pcre/current/doc/perltest.txt (original)
+++ httpd/httpd/vendor/pcre/current/doc/perltest.txt Mon Nov 26 08:49:53 2007
@@ -29,5 +29,5 @@
test some features of PCRE. Some of these files also contains malformed regular
expressions, in order to check that PCRE diagnoses them correctly.
-Philip Hazel <ph...@cam.ac.uk>
+Philip Hazel
September 2004