You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@harmony.apache.org by "Richard Liang (JIRA)" <ji...@apache.org> on 2006/06/28 10:16:29 UTC

[jira] Created: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

java.util.regex.Matcher does not support Unicode supplementary characters
-------------------------------------------------------------------------

         Key: HARMONY-688
         URL: http://issues.apache.org/jira/browse/HARMONY-688
     Project: Harmony
        Type: Bug

  Components: Classlib  
    Reporter: Richard Liang


Hello Nikolay,

The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.

    public void test_matcher() {
        Pattern p = Pattern.compile("\\p{javaLowerCase}");
        Matcher matcher = p.matcher("\uD801\uDC28");
        assertTrue(matcher.find());
    }

Best regards,
Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Anton Ivanov updated HARMONY-688:
---------------------------------

    Attachment: patch_src.txt
                patch_tests.txt

This patch adds Unicode supplementary characters support to java.util.regex package.
List of changes:

patch_src.diff 

changed files:

-ReluctantQuantifierSet.java
-CompositeQuantifierSet.java
    Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint now.
-EmptySet.java
    Methods find() and findBack() are implemented to override default find() and findBack() implementation in order
    to not find an empty string in the middle of a surrogate pair.
-Lexer.java
    Removed Character's methods calls via Lexer stubs methods. Using Character's method calls directly now.
    Removed these stubs too.
    Added method nextCodePoint() to read supplementary codepoints from an input string not as a pair of chars.
    Added methods to determine if the given value is of high surrogate or low surrogate range. 
-SequenceSet.java
    Added support for new classes to method first().
-DotQuantifierSet.java
    We can build DotQuantifierSet over AbstractSets now since DotSet is a subclass of JointSet now.
-DotSet.java
-DotAllSet.java
    Now dot construction can consume one (not supplementary codepoint is consumed) or two 
    (supplementary codepoint consisting of 2 chars is consumed) chars, so they are not LeafSets any more and
    we subclass them from JointSet. And we have to implement matches() method for both of these classes due to this.
-CharClass.java
    Added support for splitting character class into two parts: only surrogate codepoints, without surrogate codepoints.
-LeafQuantifierSet.java
-UnifiedQuantifierSet.java
    Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint now.
-SingleDecompositions.java
    Fixing comments.
-RangeSet.java
    Added support for new classes to method first().
-Pattern.java
    Added support for compilation of constructions with surrogate codepoints into corresponding nodes.
-DotAllQuantifierSet.java
    We can build DotAllQuantifierSet over AbstractSets now since DotAllSet is a subclass of JointSet now.
-PosPlusGroupQuantifierSet.java
    New classes are subclasses of JointSets, but they are not normal JointSets and they have no FSet field, so
    we fix this issue.
-CharSet.java
    Fixing issue with toString() call to CharSequence object.
    Added support for new classes to method first().
-UCICharSet.java
    Removing unused method getChar().
-DecomposedCharSet.java
    Removed Character's methods calls via Lexer stubs methods. Using Character's method calls directly now.
-UCIRangeSet.java
    Added constructor.
-AltQuantifierSet.java
    Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint now.
-AbstractCharClass.java
    Added support for splitting character class into two parts: only surrogate codepoints, without surrogate codepoints.

new files:

-  CompositeRangeSet.java       
     This class is used to split the range that contains surrogate characters into two ranges: the first consisting of these surrogate characters and the second consisting of all other characters from the parent range.
     This class represents the parent range splitted in such a manner.
-  HighSurrogateCharSet.java
     This class represents high surrogate character.
-  LowHighSurrogateRangeSet.java  
     This class is a range that contains only surrogate characters.
-  LowSurrogateCharSet.java   
     This class represents low surrogate character.
-  SupplCharSet.java
     Represents node accepting single supplementary codepoint.
-  SupplRangeSet.java
     Represents node accepting single character from the given char class.
     This character can be supplementary (2 chars needed to represent) or from 
     basic multilingual pane (1 needed char to represent it).
-  UCISupplCharSet.java  
     Represents node accepting single supplementary 
     codepoint in Unicode case insensitive manner.
-  UCISupplRangeSet.java  
     Represents node accepting single character from the given char class
     in Unicode case insensitive manner.
     This character can be supplementary (2 chars to represent) or from 
     basic multilingual pane (1 char to represent).

patch_tests.diff

      Added unit tests for using supplementary characters and surrogate codepoints.

Thanks,
Anton

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Anton Ivanov updated HARMONY-688:
---------------------------------

    Attachment: patch_tests_corrected.txt

There were several commits to regex unit tests. So patch_tests.txt is out of date. 
1) I updated the patch for unit tests. 
2) Also I removed several System.out.println() calls 
(simple debug output) from unit tests that were left by mistake. See the dev list for details.


Now current patches to apply are:

patch_src_corrected.txt

patch_tests_corrected.txt

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Assigned To: Tim Ellison
>         Attachments: patch_src.txt, patch_src_corrected.txt, patch_tests.txt, patch_tests_corrected.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HARMONY-688?page=comments#action_12441730 ] 
            
Anton Ivanov commented on HARMONY-688:
--------------------------------------

Some tests may not pass while executing on reference implementation 
but they have to pass according to the Unicode specification. 

For example while trying to pass PatternTest on the RI you can get the following 
test failure: 

testPredefinedClassesWithSurrogatesSupplementary
junit.framework.AssertionFailedError: null
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at junit.framework.Assert.assertFalse(Assert.java:34)
at junit.framework.Assert.assertFalse(Assert.java:41)
at
org.apache.harmony.tests.java.util.regex.PatternTest.testPredefinedClassesWithSurrogatesSupplementary (PatternTest.java:1477)
 
Here we try to find surrogate character in a codepoint \uD916\uDE27.
It is written here:
http://www.unicode.org/reports/tr18/#Supplementary_Characters
 
"Surrogate pairs (or their equivalents in other encoding forms) are be handled internally as single code point values"
 
So we have to treat text as code points not code units.
Here \uD916\uDE27 is a one code point consisting of 
two code units (two surrogate characters) so we find nothing.
But the RI doesn't treat this codepoint as a single whole, this is the RI bug 
and this is wrong according to the technical report.

This issue is a right candidate to mark as non bug difference.

Thanks,
Anton

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Assigned To: Tim Ellison
>         Attachments: patch_src.txt, patch_src_corrected.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Tim Ellison reassigned HARMONY-688:
-----------------------------------

    Assignee: Tim Ellison

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Assigned To: Tim Ellison
>         Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Reopened: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Tim Ellison reopened HARMONY-688:
---------------------------------

             
I've backed out this patch in r454575 since it causes failures in j.u.Scanner -- I'll raise it on the -dev list.


> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Assigned To: Tim Ellison
>         Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Tim Ellison resolved HARMONY-688.
---------------------------------

    Resolution: Fixed

Thanks Nikolay,

Patch applied to REGEX module at repo revision r454541.

Please check that the patch was applied as you expected.


> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Assigned To: Tim Ellison
>         Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by Richard Liang <ri...@gmail.com>.


Nikolay Kuznetsov (JIRA) wrote:
>     [ http://issues.apache.org/jira/browse/HARMONY-688?page=comments#action_12418290 ] 
>
> Nikolay Kuznetsov commented on HARMONY-688:
> -------------------------------------------
>
> Yes, we do not support supplementary characters. The main reason for this was that such a support breaks quantifiers optimizations over character classes of fixed length(we support 1:-)). Now I think that I can support two different types of character classes: one for fixed with 1(2), second for unknown(1 or 2, \\p{javaLowerCase}, for instance).
>
>   
Great! Now I'm eager for this function. Thanks a lot. ;-) 
> BTW, am I right that if we do not take into account unicode normalization support this problem affects only character classes and ranges behaviour? 
Yes, I think so.
> In all the other cases it's impossible to construct such a pattern which will work incorrectly, if not could you please give me an example.
>   
I'm not sure. At least, I cannot give the example. ;-)
> Thanks.
>    Nik.
>
>   
>> java.util.regex.Matcher does not support Unicode supplementary characters
>> -------------------------------------------------------------------------
>>
>>          Key: HARMONY-688
>>          URL: http://issues.apache.org/jira/browse/HARMONY-688
>>      Project: Harmony
>>         Type: Bug
>>     
>
>   
>>   Components: Classlib
>>     Reporter: Richard Liang
>>     
>
>   
>> Hello Nikolay,
>> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>>     public void test_matcher() {
>>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>>         Matcher matcher = p.matcher("\uD801\uDC28");
>>         assertTrue(matcher.find());
>>     }
>> Best regards,
>> Richard
>>     
>
>   

-- 
Richard Liang
China Software Development Lab, IBM

[jira] Commented: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Nikolay Kuznetsov (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HARMONY-688?page=comments#action_12418290 ] 

Nikolay Kuznetsov commented on HARMONY-688:
-------------------------------------------

Yes, we do not support supplementary characters. The main reason for this was that such a support breaks quantifiers optimizations over character classes of fixed length(we support 1:-)). Now I think that I can support two different types of character classes: one for fixed with 1(2), second for unknown(1 or 2, \\p{javaLowerCase}, for instance).

BTW, an I right that if we do not take into account unicode normalization support this problem affects only character classes and ranges behaviour? In all the other cases it's impossible to construct such a pattern wich will work incorrectly, if not could you please give me an example.

Thanks.
   Nik.

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>          Key: HARMONY-688
>          URL: http://issues.apache.org/jira/browse/HARMONY-688
>      Project: Harmony
>         Type: Bug

>   Components: Classlib
>     Reporter: Richard Liang

>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters

Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Anton Ivanov updated HARMONY-688:
---------------------------------

    Attachment: patch_src_corrected.txt

I corrected the patch (patch_src.txt) and attached it to the issue (patch_src_corrected.txt).
I verified that regex and luni tests pass normally with the patch applied. 

There was a bug in the newly created class SupplRangeSet.java.
There was the following code in the method matches() of SupplRangeSet.java:

...
        if (stringIndex < strLength) {            
            char high = testString.charAt(stringIndex++);
            
            if (contains(high) && 
                    next.matches(stringIndex, testString, matchResult) > 0) {
                return 1;
            }
...

But it is wrong simply to return 1, though we can read about method matches() in AbstractSet.java comments: 

 "Checks if this node matches in given position and recursively call
  next node matches on positive self match. Returns positive integer if 
  entire match succeed, negative otherwise
  return -1 if match fails or n > 0;"

In fact method matches() returns not only a positive n > 0. The n is an offset in case of a positive
match attempt. This fact is took into account in all old classes of java.util.regex, but I forgot this fact in SupplRangeSet.java
So I corrected method matches() of the SupplRangeSet class as follows:

...
        int offset = -1;

        if (stringIndex < strLength) {            
            char high = testString.charAt(stringIndex++);
            
            if (contains(high) && 
                    (offset = next.matches(stringIndex, testString, matchResult)) > 0) {
                return offset;
            }
...

Thanks,
Anton

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Assigned To: Tim Ellison
>         Attachments: patch_src.txt, patch_src_corrected.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira