You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by Takashi Okamoto <to...@rd.nttdata.co.jp> on 2001/04/01 08:57:50 UTC
POSIX Rexpressioin
Hi!!
I investigated Perl5.6 [:punct:], [:graph:] and [:print:] more detail.
(ORO goal is compatibility with Perl5.6.)
Now ORO's matching characters are a little different from Perl5.6.
I compared Perl5.6's matching result and Java's Unicode Block.
[:print:]
- Following unicode blocks are NOT matching Perl5.6's [:print:]
Character.CONTROL
Character.FORMAT
Character.SURROGATE
Character.PRIVATE_USE
Others are matching.
- And some of Character.UNASSIGNED are matching [:print:]. But
we can ignore them because they are not defined unicode block.
(these characters are used for only special purpose.)
[:graph:]
[:graph:] is excepted only U+0020 from [:print:] characters.
[:punct:]
- Following unicode blocks are matching Perl5.6's [:punct:]
Character.DASH_PUNCTUATION
Character.START_PUNCTUATION
Character.END_PUNCTUATION
Character.CONNECTOR_PUNCTUATION
- Some of Character.OTHER_PUNCTUATION characters are matching
,but followig characters are NOT matching it.
U+0374, U+0375, U+0E2F, U+0EAF, U+3006
I attached patch including above result. However it makes code a
little tricky and user may not require such a detail compatibility.
Is it nitpick?;)
Regards.
---------------------
Takashi Okamoto
--- Perl5Matcher.java.orig Sat Mar 31 23:01:12 2001
+++ Perl5Matcher.java Sun Apr 1 11:33:55 2001
@@ -682,25 +682,58 @@
if(Character.isUpperCase(code)) return isANYOF;
break;
case OpCode._PRINT:
- if(Character.isSpaceChar(code)) return isANYOF;
- // Fall through to check if the character is alphanumeric,
- // or a punctuation mark. Printable characters are either
- // alphanumeric, punctuation marks, or spaces.
+ switch( Character.getType(code) ) {
+ // Following unicode blocks do NOT match [:print:].
+ case Character.UNASSIGNED:
+ case Character.CONTROL:
+ case Character.FORMAT:
+ case Character.SURROGATE:
+ case Character.PRIVATE_USE:
+ break;
+ default:
+ // Others match.
+ return isANYOF;
+ }
case OpCode._GRAPH:
- if(Character.isLetterOrDigit(code)) return isANYOF;
- // Fall through to check if the character is a punctuation mark.
- // Graph characters are either alphanumeric or punctuation.
+ switch ( Character.getType(code) ) {
+ // Following unicode blocks do NOT match [:graph:].
+ case Character.UNASSIGNED:
+ case Character.CONTROL:
+ case Character.FORMAT:
+ case Character.SURROGATE:
+ case Character.PRIVATE_USE:
+ break;
+ default:
+ // Others match except U+0020.
+ if ( code != 0x0020 )
+ return isANYOF;
+ break;
+ }
case OpCode._PUNCT:
switch ( Character.getType(code) ) {
+ // Following unicode blocks match [:punct:].
case Character.DASH_PUNCTUATION:
case Character.START_PUNCTUATION:
case Character.END_PUNCTUATION:
case Character.CONNECTOR_PUNCTUATION:
- case Character.OTHER_PUNCTUATION:
return isANYOF;
- default:
- break;
+ case Character.OTHER_PUNCTUATION:
+ switch ( code ) {
+ // following OTHER_PUNCTUATION characters don't match
+ // Perl5.6's [:punct:]
+ case 0x0374:
+ case 0x0375:
+ case 0x0e2f:
+ case 0x0eaf:
+ case 0x3006:
+ break;
+ default:
+ // other OTHER_PUNCTUATION characters match.
+ return isANYOF;
}
+ default:
+ break;
+ }
break;
case OpCode._XDIGIT:
if( (code >= '0' && code <= '9') ||