You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by df...@apache.org on 2001/05/17 21:00:58 UTC
cvs commit: jakarta-oro/src/java/org/apache/oro/text/regex package.html
dfs 01/05/17 12:00:58
Modified: src/java/org/apache/oro/text/regex package.html
Log:
Added description of supported Perl5 regular expression syntax from the old
OROMatcher user's guide. This will be moved into a new user's guide.
Revision Changes Path
1.2 +130 -1 jakarta-oro/src/java/org/apache/oro/text/regex/package.html
Index: package.html
===================================================================
RCS file: /home/cvs/jakarta-oro/src/java/org/apache/oro/text/regex/package.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- package.html 2000/07/23 23:08:54 1.1
+++ package.html 2001/05/17 19:00:55 1.2
@@ -1,7 +1,136 @@
-<!-- $Id: package.html,v 1.1 2000/07/23 23:08:54 jon Exp $ -->
+<!-- $Id: package.html,v 1.2 2001/05/17 19:00:55 dfs Exp $ -->
<body>
This package used to be the OROMatcher library and provides both
generic regular expression interfaces and Perl5 regular expression
compatible implementation classes.
+
+<p>
+<em>Note: The following information will be moved into the user's guide.</em>
+</p>
+
+<h1> Perl5 regular expressions </h1>
+</a>
+<p>
+Here we summarize the syntax of Perl5.003 regular expressions, all of
+which is supported by the Perl5 classes in this package. However, for
+a definitive reference, you should consult the
+<a href="http://www.perl.org/CPAN/doc/manual/html/pod/perlre.html">
+<code>perlre</code> man page </a>
+that accompanies the Perl5 distribution and also the book
+<em> Programming Perl, 2nd Edition </em> from O'Reilly & Associates.
+We are working toward implementing the features added after Perl5.003
+up to and including Perl 5.6. Please remember, we only guarantee
+support for Perl5.003 expressions in version 2.0.
+
+<p>
+<ul>
+<li> Alternatives separated by |
+<li> Quantified atoms
+ <dl compact>
+ <dt> {n,m} <dd> Match at least n but not more than m times.
+ <dt> {n,} <dd> Match at least n times.
+ <dt> {n} <dd> Match exactly n times.
+ <dt> * <dd> Match 0 or more times.
+ <dt> + <dd> Match 1 or more times.
+ <dt> ? <dd> Match 0 or 1 times.
+ </dl>
+ <li> Atoms
+ <ul>
+ <li> regular expression within parentheses
+ <li> a . matches everything except \n
+ <li> a ^ is a null token matching the beginning of a string or line
+ (i.e., the position right after a newline or right before
+ the beginning of a string)
+ <li> a $ is a null token matching the end of a string or line
+ (i.e., the position right before a newline or right after
+ the end of a string)
+ <li> Character classes (e.g., [abcd]) and ranges (e.g. [a-z])
+ <ul>
+ <li> Special backslashed characters work within a character
+ class (except for backreferences and boundaries).
+ <li> \b is backspace inside a character class
+ </ul>
+ <li> Special backslashed characters
+ <dl compact>
+ <dt> \b <dd> null token matching a word boundary (\w on one side
+ and \W on the other)
+ <dt> \B <dd> null token matching a boundary that isn't a
+ word boundary
+ <dt> \A <dd> Match only at beginning of string
+ <dt> \Z <dd> Match only at end of string (or before newline
+ at the end)
+ <dt> \n <dd> newline
+ <dt> \r <dd> carriage return
+ <dt> \t <dd> tab
+ <dt> \f <dd> formfeed
+ <dt> \d <dd> digit [0-9]
+ <dt> \D <dd> non-digit [^0-9]
+ <dt> \w <dd> word character [0-9a-z_A-Z]
+ <dt> \W <dd> a non-word character [^0-9a-z_A-Z]
+ <dt> \s <dd> a whitespace character [ \t\n\r\f]
+ <dt> \S <dd> a non-whitespace character [^ \t\n\r\f]
+ <dt> \xnn <dd> hexadecimal representation of character
+ <dt> \cD <dd> matches the corresponding control character
+ <dt> \nn or \nnn <dd> octal representation of character
+ unless a backreference. a
+ <dt> \1, \2, \3, etc. <dd> match whatever the first, second,
+ third, etc. parenthesized group matched. This is called a
+ backreference. If there is no corresponding group, the
+ number is interpreted as an octal representation of a character.
+ <dt> \0 <dd> matches null character
+ <dt> Any other backslashed character matches itself
+ </dl>
+ </ul>
+ <li> Expressions within parentheses are matched as subpattern groups
+ and saved for use by certain methods.
+ </ul>
+
+<p>
+By default, a quantified subpattern is <em> greedy </em>.
+In other words it matches as many times as possible without causing
+the rest of the pattern not to match. To change the quantifiers
+to match the minimum number of times possible, without
+causing the rest of the pattern not to match, you may use
+a "?" right after the quantifier.
+
+<dl compact>
+<dt> *? <dd> Match 0 or more times
+<dt> +? <dd> Match 1 or more times
+<dt> ?? <dd> Match 0 or 1 time
+<dt> {n}? <dd> Match exactly n times
+<dt> {n,}? <dd> Match at least n times
+<dt> {n,m}? <dd> Match at least n but not more than m times
+</dl>
+
+<p>
+<b> Perl5 extended regular expressions </b> are fully supported.
+
+<dl compact>
+<dt> (?#text) <dd> An embedded comment causing text to be ignored.
+<dt> (?:regexp) <dd> Groups things like "()" but doesn't cause the
+ group match to be saved.
+<dt> (?=regexp) <dd>
+ A zero-width positive lookahead assertion. For
+ example, \w+(?=\s) matches a word followed by
+ whitespace, without including whitespace in the
+ MatchResult.
+
+<dt> (?!regexp) <dd>
+ A zero-width negative lookahead assertion. For
+ example foo(?!bar) matches any occurrence of
+ "foo" that isn't followed by "bar". Remember
+ that this is a zero-width assertion, which means
+ that a(?!b)d will match ad because a is followed
+ by a character that is not b (the d) and a d
+ follows the zero-width assertion.
+
+
+<dt> (?imsx) <dd> One or more embedded pattern-match modifiers.
+ i enables case insensitivity, m enables multiline
+ treatment of the input, s enables single line treatment
+ of the input, and x enables extended whitespace comments.
+</ul>
+
+
</body>