You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@harmony.apache.org by "Richard Liang (JIRA)" <ji...@apache.org> on 2006/06/28 10:16:29 UTC
[jira] Created: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
java.util.regex.Matcher does not support Unicode supplementary characters
-------------------------------------------------------------------------
Key: HARMONY-688
URL: http://issues.apache.org/jira/browse/HARMONY-688
Project: Harmony
Type: Bug
Components: Classlib
Reporter: Richard Liang
Hello Nikolay,
The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
public void test_matcher() {
Pattern p = Pattern.compile("\\p{javaLowerCase}");
Matcher matcher = p.matcher("\uD801\uDC28");
assertTrue(matcher.find());
}
Best regards,
Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]
Anton Ivanov updated HARMONY-688:
---------------------------------
Attachment: patch_src.txt
patch_tests.txt
This patch adds Unicode supplementary characters support to java.util.regex package.
List of changes:
patch_src.diff
changed files:
-ReluctantQuantifierSet.java
-CompositeQuantifierSet.java
Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint now.
-EmptySet.java
Methods find() and findBack() are implemented to override default find() and findBack() implementation in order
to not find an empty string in the middle of a surrogate pair.
-Lexer.java
Removed Character's methods calls via Lexer stubs methods. Using Character's method calls directly now.
Removed these stubs too.
Added method nextCodePoint() to read supplementary codepoints from an input string not as a pair of chars.
Added methods to determine if the given value is of high surrogate or low surrogate range.
-SequenceSet.java
Added support for new classes to method first().
-DotQuantifierSet.java
We can build DotQuantifierSet over AbstractSets now since DotSet is a subclass of JointSet now.
-DotSet.java
-DotAllSet.java
Now dot construction can consume one (not supplementary codepoint is consumed) or two
(supplementary codepoint consisting of 2 chars is consumed) chars, so they are not LeafSets any more and
we subclass them from JointSet. And we have to implement matches() method for both of these classes due to this.
-CharClass.java
Added support for splitting character class into two parts: only surrogate codepoints, without surrogate codepoints.
-LeafQuantifierSet.java
-UnifiedQuantifierSet.java
Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint now.
-SingleDecompositions.java
Fixing comments.
-RangeSet.java
Added support for new classes to method first().
-Pattern.java
Added support for compilation of constructions with surrogate codepoints into corresponding nodes.
-DotAllQuantifierSet.java
We can build DotAllQuantifierSet over AbstractSets now since DotAllSet is a subclass of JointSet now.
-PosPlusGroupQuantifierSet.java
New classes are subclasses of JointSets, but they are not normal JointSets and they have no FSet field, so
we fix this issue.
-CharSet.java
Fixing issue with toString() call to CharSequence object.
Added support for new classes to method first().
-UCICharSet.java
Removing unused method getChar().
-DecomposedCharSet.java
Removed Character's methods calls via Lexer stubs methods. Using Character's method calls directly now.
-UCIRangeSet.java
Added constructor.
-AltQuantifierSet.java
Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint now.
-AbstractCharClass.java
Added support for splitting character class into two parts: only surrogate codepoints, without surrogate codepoints.
new files:
- CompositeRangeSet.java
This class is used to split the range that contains surrogate characters into two ranges: the first consisting of these surrogate characters and the second consisting of all other characters from the parent range.
This class represents the parent range splitted in such a manner.
- HighSurrogateCharSet.java
This class represents high surrogate character.
- LowHighSurrogateRangeSet.java
This class is a range that contains only surrogate characters.
- LowSurrogateCharSet.java
This class represents low surrogate character.
- SupplCharSet.java
Represents node accepting single supplementary codepoint.
- SupplRangeSet.java
Represents node accepting single character from the given char class.
This character can be supplementary (2 chars needed to represent) or from
basic multilingual pane (1 needed char to represent it).
- UCISupplCharSet.java
Represents node accepting single supplementary
codepoint in Unicode case insensitive manner.
- UCISupplRangeSet.java
Represents node accepting single character from the given char class
in Unicode case insensitive manner.
This character can be supplementary (2 chars to represent) or from
basic multilingual pane (1 char to represent).
patch_tests.diff
Added unit tests for using supplementary characters and surrogate codepoints.
Thanks,
Anton
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]
Anton Ivanov updated HARMONY-688:
---------------------------------
Attachment: patch_tests_corrected.txt
There were several commits to regex unit tests. So patch_tests.txt is out of date.
1) I updated the patch for unit tests.
2) Also I removed several System.out.println() calls
(simple debug output) from unit tests that were left by mistake. See the dev list for details.
Now current patches to apply are:
patch_src_corrected.txt
patch_tests_corrected.txt
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Assigned To: Tim Ellison
> Attachments: patch_src.txt, patch_src_corrected.txt, patch_tests.txt, patch_tests_corrected.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=comments#action_12441730 ]
Anton Ivanov commented on HARMONY-688:
--------------------------------------
Some tests may not pass while executing on reference implementation
but they have to pass according to the Unicode specification.
For example while trying to pass PatternTest on the RI you can get the following
test failure:
testPredefinedClassesWithSurrogatesSupplementary
junit.framework.AssertionFailedError: null
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at junit.framework.Assert.assertFalse(Assert.java:34)
at junit.framework.Assert.assertFalse(Assert.java:41)
at
org.apache.harmony.tests.java.util.regex.PatternTest.testPredefinedClassesWithSurrogatesSupplementary (PatternTest.java:1477)
Here we try to find surrogate character in a codepoint \uD916\uDE27.
It is written here:
http://www.unicode.org/reports/tr18/#Supplementary_Characters
"Surrogate pairs (or their equivalents in other encoding forms) are be handled internally as single code point values"
So we have to treat text as code points not code units.
Here \uD916\uDE27 is a one code point consisting of
two code units (two surrogate characters) so we find nothing.
But the RI doesn't treat this codepoint as a single whole, this is the RI bug
and this is wrong according to the technical report.
This issue is a right candidate to mark as non bug difference.
Thanks,
Anton
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Assigned To: Tim Ellison
> Attachments: patch_src.txt, patch_src_corrected.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Assigned: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]
Tim Ellison reassigned HARMONY-688:
-----------------------------------
Assignee: Tim Ellison
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Assigned To: Tim Ellison
> Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Reopened: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]
Tim Ellison reopened HARMONY-688:
---------------------------------
I've backed out this patch in r454575 since it causes failures in j.u.Scanner -- I'll raise it on the -dev list.
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Assigned To: Tim Ellison
> Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]
Tim Ellison resolved HARMONY-688.
---------------------------------
Resolution: Fixed
Thanks Nikolay,
Patch applied to REGEX module at repo revision r454541.
Please check that the patch was applied as you expected.
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Assigned To: Tim Ellison
> Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Commented: (HARMONY-688) java.util.regex.Matcher does
not support Unicode supplementary characters
Posted by Richard Liang <ri...@gmail.com>.
Nikolay Kuznetsov (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/HARMONY-688?page=comments#action_12418290 ]
>
> Nikolay Kuznetsov commented on HARMONY-688:
> -------------------------------------------
>
> Yes, we do not support supplementary characters. The main reason for this was that such a support breaks quantifiers optimizations over character classes of fixed length(we support 1:-)). Now I think that I can support two different types of character classes: one for fixed with 1(2), second for unknown(1 or 2, \\p{javaLowerCase}, for instance).
>
>
Great! Now I'm eager for this function. Thanks a lot. ;-)
> BTW, am I right that if we do not take into account unicode normalization support this problem affects only character classes and ranges behaviour?
Yes, I think so.
> In all the other cases it's impossible to construct such a pattern which will work incorrectly, if not could you please give me an example.
>
I'm not sure. At least, I cannot give the example. ;-)
> Thanks.
> Nik.
>
>
>> java.util.regex.Matcher does not support Unicode supplementary characters
>> -------------------------------------------------------------------------
>>
>> Key: HARMONY-688
>> URL: http://issues.apache.org/jira/browse/HARMONY-688
>> Project: Harmony
>> Type: Bug
>>
>
>
>> Components: Classlib
>> Reporter: Richard Liang
>>
>
>
>> Hello Nikolay,
>> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
>> public void test_matcher() {
>> Pattern p = Pattern.compile("\\p{javaLowerCase}");
>> Matcher matcher = p.matcher("\uD801\uDC28");
>> assertTrue(matcher.find());
>> }
>> Best regards,
>> Richard
>>
>
>
--
Richard Liang
China Software Development Lab, IBM
[jira] Commented: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Nikolay Kuznetsov (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=comments#action_12418290 ]
Nikolay Kuznetsov commented on HARMONY-688:
-------------------------------------------
Yes, we do not support supplementary characters. The main reason for this was that such a support breaks quantifiers optimizations over character classes of fixed length(we support 1:-)). Now I think that I can support two different types of character classes: one for fixed with 1(2), second for unknown(1 or 2, \\p{javaLowerCase}, for instance).
BTW, an I right that if we do not take into account unicode normalization support this problem affects only character classes and ranges behaviour? In all the other cases it's impossible to construct such a pattern wich will work incorrectly, if not could you please give me an example.
Thanks.
Nik.
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Type: Bug
> Components: Classlib
> Reporter: Richard Liang
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (HARMONY-688) java.util.regex.Matcher does not
support Unicode supplementary characters
Posted by "Anton Ivanov (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]
Anton Ivanov updated HARMONY-688:
---------------------------------
Attachment: patch_src_corrected.txt
I corrected the patch (patch_src.txt) and attached it to the issue (patch_src_corrected.txt).
I verified that regex and luni tests pass normally with the patch applied.
There was a bug in the newly created class SupplRangeSet.java.
There was the following code in the method matches() of SupplRangeSet.java:
...
if (stringIndex < strLength) {
char high = testString.charAt(stringIndex++);
if (contains(high) &&
next.matches(stringIndex, testString, matchResult) > 0) {
return 1;
}
...
But it is wrong simply to return 1, though we can read about method matches() in AbstractSet.java comments:
"Checks if this node matches in given position and recursively call
next node matches on positive self match. Returns positive integer if
entire match succeed, negative otherwise
return -1 if match fails or n > 0;"
In fact method matches() returns not only a positive n > 0. The n is an offset in case of a positive
match attempt. This fact is took into account in all old classes of java.util.regex, but I forgot this fact in SupplRangeSet.java
So I corrected method matches() of the SupplRangeSet class as follows:
...
int offset = -1;
if (stringIndex < strLength) {
char high = testString.charAt(stringIndex++);
if (contains(high) &&
(offset = next.matches(stringIndex, testString, matchResult)) > 0) {
return offset;
}
...
Thanks,
Anton
> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
> Key: HARMONY-688
> URL: http://issues.apache.org/jira/browse/HARMONY-688
> Project: Harmony
> Issue Type: Bug
> Components: Classlib
> Reporter: Richard Liang
> Assigned To: Tim Ellison
> Attachments: patch_src.txt, patch_src_corrected.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony. Would you please have a look at this issue? Thanks a lot.
> public void test_matcher() {
> Pattern p = Pattern.compile("\\p{javaLowerCase}");
> Matcher matcher = p.matcher("\uD801\uDC28");
> assertTrue(matcher.find());
> }
> Best regards,
> Richard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira