You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@harmony.apache.org by "Anton Ivanov (JIRA)" <ji...@apache.org> on 2006/07/20 17:11:19 UTC

[jira] Updated: (HARMONY-933) java.util.regex.Pattern doesn't support canonical equivalence

     [ http://issues.apache.org/jira/browse/HARMONY-933?page=all ]

Anton Ivanov updated HARMONY-933:
---------------------------------

    Attachment: normalization_files.zip
                patch_src.diff
                patch_tests.diff

Here is the implementation of Unicode normalization for java.util.regex package.

Two patch files:

1. patch_src.diff 
Added method normalize() for pattern normalization to the Lexer class. This includes added auxiliary methods
for canonical ordering, algorithmically getting Hangul decomposition and getting canonical decomposition
based on mappings in the special hash table.
Added method processDecomposedChar() for collecting decomposed chars in Pattern class.

2. patch_tests.diff
Added unit tests for Unicode normalization.

Archive normalization_files.zip contains following classes:

3. CanClasses.java
This class gives us a hashtable that contains canonical classes. Generated from
http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt.

4. HashDecompositions.java
This class gives us a hashtable that contains canonical
decomposition mappings. Generated from
http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt.

5. SingleDecompositions.java
This class gives us a hashtable that contains information about
symbols that have decomposition and canonical class 0. Generated from
http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt.

6. IntHash.java 
Hashtable implementation for int values.

7. IntArrHash.java
Hashtable implementation for int arrays.

8. DecomposedCharSet.java
Represents canonical decomposition of Unicode character. It is used when
CANON_EQ flag of the Pattern class is specified.

9. CIDecomposedCharSet.java
Represents case insensitive canonical decomposition of
Unicode character. Is used when CANON_EQ flag of Pattern class
is specified. It is a stub.

10. UCIDecomposedCharSet.java
Represents Unicode case insensitive canonical decomposition of
Unicode character. Is used when CANON_EQ flag of Pattern class
is specified. It is a stub.

11. HangulDecomposedCharSet.java
Represents canonical decomposition of Hangul syllable. It is used when
CANON_EQ flag of Pattern class is specified.

Note that some added unit tests can cause exceptions while executing on reference 
implementation but they are correct according to Unicode specification.
Also some tests may not pass while executing on reference implementation
but they have to pass according to Unicode specification. 

> java.util.regex.Pattern doesn't support canonical equivalence
> -------------------------------------------------------------
>
>                 Key: HARMONY-933
>                 URL: http://issues.apache.org/jira/browse/HARMONY-933
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Anton Ivanov
>            Priority: Minor
>         Attachments: normalization_files.zip, patch_src.diff, patch_tests.diff
>
>
> Canonical equivalence support is not implemented in java.util.regex. Flag Pattern.CANON_EQ is ignored.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira