You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by df...@apache.org on 2001/05/17 21:00:58 UTC

cvs commit: jakarta-oro/src/java/org/apache/oro/text/regex package.html

dfs         01/05/17 12:00:58

  Modified:    src/java/org/apache/oro/text/regex package.html
  Log:
  Added description of supported Perl5 regular expression syntax from the old
  OROMatcher user's guide.  This will be moved into a new user's guide.
  
  Revision  Changes    Path
  1.2       +130 -1    jakarta-oro/src/java/org/apache/oro/text/regex/package.html
  
  Index: package.html
  ===================================================================
  RCS file: /home/cvs/jakarta-oro/src/java/org/apache/oro/text/regex/package.html,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- package.html	2000/07/23 23:08:54	1.1
  +++ package.html	2001/05/17 19:00:55	1.2
  @@ -1,7 +1,136 @@
  -<!-- $Id: package.html,v 1.1 2000/07/23 23:08:54 jon Exp $ -->
  +<!-- $Id: package.html,v 1.2 2001/05/17 19:00:55 dfs Exp $ -->
   <body>
   This package used to be the OROMatcher library and provides both
   generic regular expression interfaces and Perl5 regular expression
   compatible implementation classes.
  +
  +<p>
  +<em>Note: The following information will be moved into the user's guide.</em>
  +</p>
  +
  +<h1> Perl5 regular expressions </h1>
  +</a>
  +<p>
  +Here we summarize the syntax of Perl5.003 regular expressions, all of
  +which is supported by the Perl5 classes in this package. However, for
  +a definitive reference, you should consult the 
  +<a href="http://www.perl.org/CPAN/doc/manual/html/pod/perlre.html">
  +<code>perlre</code> man page </a>
  +that accompanies the Perl5 distribution and also the book
  +<em> Programming Perl, 2nd Edition </em> from O'Reilly & Associates.
  +We are working toward implementing the features added after Perl5.003
  +up to and including Perl 5.6.  Please remember, we only guarantee
  +support for Perl5.003 expressions in version 2.0.
  +
  +<p>
  +<ul>
  +<li> Alternatives separated by |
  +<li> Quantified atoms
  + <dl compact>
  +      <dt> {n,m} <dd> Match at least n but not more than m times.
  +      <dt> {n,}  <dd> Match at least n times.
  +      <dt> {n}   <dd> Match exactly n times.  
  +      <dt> *     <dd> Match 0 or more times.
  +      <dt> +     <dd> Match 1 or more times.
  +      <dt> ?     <dd> Match 0 or 1 times.
  + </dl>
  + <li> Atoms
  + <ul>
  +     <li> regular expression within parentheses
  +     <li> a . matches everything except \n
  +     <li> a ^ is a null token matching the beginning of a string or line
  +          (i.e., the position right after a newline or right before
  +           the beginning of a string)
  +     <li> a $ is a null token matching the end of a string or line
  +          (i.e., the position right before a newline or right after
  +           the end of a string)
  +     <li> Character classes (e.g., [abcd]) and ranges (e.g. [a-z])
  +     <ul>
  +         <li> Special backslashed characters work within a character
  +              class (except for backreferences and boundaries).  
  +         <li> \b is backspace inside a character class
  +     </ul>
  +     <li> Special backslashed characters
  +     <dl compact>
  +         <dt> \b <dd> null token matching a word boundary (\w on one side
  +                      and \W on the other)
  +         <dt> \B <dd> null token matching a boundary that isn't a
  +                      word boundary
  +         <dt> \A <dd> Match only at beginning of string
  +         <dt> \Z <dd> Match only at end of string (or before newline
  +                      at the end)
  +         <dt> \n <dd> newline
  +         <dt> \r <dd> carriage return
  +         <dt> \t <dd> tab
  +         <dt> \f <dd> formfeed
  +         <dt> \d <dd> digit [0-9]
  +         <dt> \D <dd> non-digit [^0-9]
  +         <dt> \w <dd> word character [0-9a-z_A-Z]
  +         <dt> \W <dd> a non-word character [^0-9a-z_A-Z]
  +         <dt> \s <dd> a whitespace character [ \t\n\r\f]
  +         <dt> \S <dd> a non-whitespace character [^ \t\n\r\f]
  +         <dt> \xnn <dd> hexadecimal representation of character
  +         <dt> \cD <dd> matches the corresponding control character
  +         <dt> \nn or \nnn <dd> octal representation of character
  +                               unless a backreference.  a 
  +         <dt> \1, \2, \3, etc. <dd> match whatever the first, second,
  +          third, etc. parenthesized group matched.  This is called a
  +          backreference.  If there is no corresponding group, the
  +          number is interpreted as an octal representation of a character.
  +         <dt> \0 <dd> matches null character
  +         <dt> Any other backslashed character matches itself
  +     </dl>
  + </ul>
  + <li> Expressions within parentheses are matched as subpattern groups
  +      and saved for use by certain methods.
  + </ul>
  +
  +<p>
  +By default, a quantified subpattern is <em> greedy </em>.
  +In other words it matches as many times as possible without causing
  +the rest of the pattern not to match. To change the quantifiers
  +to match the minimum number of times possible, without
  +causing the rest of the pattern not to match, you may use
  +a "?" right after the quantifier.
  +
  +<dl compact>
  +<dt> *?     <dd> Match 0 or more times
  +<dt> +?     <dd> Match 1 or more times
  +<dt> ??     <dd> Match 0 or 1 time
  +<dt> {n}?   <dd> Match exactly n times
  +<dt> {n,}?  <dd> Match at least n times
  +<dt> {n,m}? <dd> Match at least n but not more than m times
  +</dl>
  +
  +<p>
  +<b> Perl5 extended regular expressions </b> are fully supported.
  +
  +<dl compact>
  +<dt> (?#text) <dd> An embedded comment causing text to be ignored.
  +<dt> (?:regexp) <dd> Groups things like "()" but doesn't cause the
  + group match to be saved.
  +<dt> (?=regexp) <dd>
  +                 A zero-width positive lookahead assertion.  For
  +                 example, \w+(?=\s) matches a word followed by
  +                 whitespace, without including whitespace in the
  +                 MatchResult.
  +
  +<dt> (?!regexp) <dd>
  +                 A zero-width negative lookahead assertion.  For
  +                 example foo(?!bar) matches any occurrence of
  +                 "foo" that isn't followed by "bar".  Remember
  +                 that this is a zero-width assertion, which means
  +                 that a(?!b)d will match ad because a is followed
  +                 by a character that is not b (the d) and a d
  +                 follows the zero-width assertion.
  +
  +
  +<dt> (?imsx) <dd> One or more embedded pattern-match modifiers.
  +                i enables case insensitivity, m enables multiline
  +                treatment of the input, s enables single line treatment
  +                of the input, and x enables extended whitespace comments.
  +</ul>
  +
  +
   </body>