You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@harmony.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/09/17 15:45:34 UTC

[jira] Created: (HARMONY-6650) Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode

Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode
---------------------------------------------------------------------------------------------------

                 Key: HARMONY-6650
                 URL: https://issues.apache.org/jira/browse/HARMONY-6650
             Project: Harmony
          Issue Type: Bug
          Components: Classlib
            Reporter: Robert Muir


While looking at Character, i noticed the code looked very different for 'int' than 'char' here.
in particular the int method defers to ICU, but the char method binsearches its own table.
and the comment for that table is:

// Unicode 3.0.1 (same as Unicode 3.0.0)
private static final char[] typeValues ....

But Unicode 3 is the wrong version for java5/6

So, i tried a character whose type changed from 3.0 to 4.0, just to see.
For example, compare these two results:

Character.getType('\u17B5') = 8 (combining mark)
Character.getType((int) '\u17B5') = 16 (format)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HARMONY-6650) Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HARMONY-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913745#action_12913745 ] 

Robert Muir commented on HARMONY-6650:
--------------------------------------

Ok, that makes good sense, I'll work up a patch.


> Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HARMONY-6650
>                 URL: https://issues.apache.org/jira/browse/HARMONY-6650
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Robert Muir
>
> While looking at Character, i noticed the code looked very different for 'int' than 'char' here.
> in particular the int method defers to ICU, but the char method binsearches its own table.
> and the comment for that table is:
> // Unicode 3.0.1 (same as Unicode 3.0.0)
> private static final char[] typeValues ....
> But Unicode 3 is the wrong version for java5/6
> So, i tried a character whose type changed from 3.0 to 4.0, just to see.
> For example, compare these two results:
> Character.getType('\u17B5') = 8 (combining mark)
> Character.getType((int) '\u17B5') = 16 (format)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HARMONY-6650) Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HARMONY-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913382#action_12913382 ] 

Robert Muir commented on HARMONY-6650:
--------------------------------------

I don't mind working up a patch for this approach.

I have one last question though, that I've been trying to figure out related to this issue.

Harmony is using icu 4.4.x (in other places too I assume?), which means things like 
these properties come from Unicode 5.2. But if I look here:

http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#unicode-version

The version is Unicode 4. Is this something that is not actually in the spec (but an impl detail)?
Or is it a compatibility issue already that harmony uses this higher version of Unicode?

If its a problem, i certainly don't have ideas on how to address it... but it would cause
lots of problems up the stack like different rendering behavior and other issues.

For reference here is a diff between 4.0 and 5.2, to show all the differences in the UCD:
http://people.apache.org/~rmuir/unicodeDiff2.txt


> Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HARMONY-6650
>                 URL: https://issues.apache.org/jira/browse/HARMONY-6650
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Robert Muir
>
> While looking at Character, i noticed the code looked very different for 'int' than 'char' here.
> in particular the int method defers to ICU, but the char method binsearches its own table.
> and the comment for that table is:
> // Unicode 3.0.1 (same as Unicode 3.0.0)
> private static final char[] typeValues ....
> But Unicode 3 is the wrong version for java5/6
> So, i tried a character whose type changed from 3.0 to 4.0, just to see.
> For example, compare these two results:
> Character.getType('\u17B5') = 8 (combining mark)
> Character.getType((int) '\u17B5') = 16 (format)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HARMONY-6650) Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode

Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HARMONY-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913377#action_12913377 ] 

Tim Ellison commented on HARMONY-6650:
--------------------------------------

I suggest we rewrite the char methods to use ICU.  I think this came about when adding the new int methods we didn't go through and fix the char versions -- just delegate then through.  My 2c.


> Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HARMONY-6650
>                 URL: https://issues.apache.org/jira/browse/HARMONY-6650
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Robert Muir
>
> While looking at Character, i noticed the code looked very different for 'int' than 'char' here.
> in particular the int method defers to ICU, but the char method binsearches its own table.
> and the comment for that table is:
> // Unicode 3.0.1 (same as Unicode 3.0.0)
> private static final char[] typeValues ....
> But Unicode 3 is the wrong version for java5/6
> So, i tried a character whose type changed from 3.0 to 4.0, just to see.
> For example, compare these two results:
> Character.getType('\u17B5') = 8 (combining mark)
> Character.getType((int) '\u17B5') = 16 (format)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HARMONY-6650) Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode

Posted by "Tim Ellison (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HARMONY-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913739#action_12913739 ] 

Tim Ellison commented on HARMONY-6650:
--------------------------------------

We made a deliberate decision to keep up with the latest ICUs and Unicode versions, so recognize that we depart from the RI in terms of compatibility here.

As you say, trying to match a particular version chosen by another implementation probably isn't a productive use of our time.  The Sun impl is moving up through the Unicode versions slowly, we're just a bit more agile than they are ;-)


> Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HARMONY-6650
>                 URL: https://issues.apache.org/jira/browse/HARMONY-6650
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Robert Muir
>
> While looking at Character, i noticed the code looked very different for 'int' than 'char' here.
> in particular the int method defers to ICU, but the char method binsearches its own table.
> and the comment for that table is:
> // Unicode 3.0.1 (same as Unicode 3.0.0)
> private static final char[] typeValues ....
> But Unicode 3 is the wrong version for java5/6
> So, i tried a character whose type changed from 3.0 to 4.0, just to see.
> For example, compare these two results:
> Character.getType('\u17B5') = 8 (combining mark)
> Character.getType((int) '\u17B5') = 16 (format)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HARMONY-6650) Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HARMONY-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913362#action_12913362 ] 

Robert Muir commented on HARMONY-6650:
--------------------------------------

I am worried the issue is more widespread in Character.
For example, isLetter has inconsistencies too, which can be seen with a simple test like this:

for (int ch = 0; ch <= Character.MAX_VALUE; ch++)
            assertEquals("inconsistency with isLetter(int)",
                    Character.isLetter(ch),
                    Character.isLetter((char)ch));

For most of these methods, the int-based version just calls UCharacter.
Is there a reason not to do this for the char-based methods too?

Otherwise, I think the various tables in the code need to be regenerated to be consistent.

> Character.getType(int) inconsistent with Character.getType(char): uses different version of unicode
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HARMONY-6650
>                 URL: https://issues.apache.org/jira/browse/HARMONY-6650
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Robert Muir
>
> While looking at Character, i noticed the code looked very different for 'int' than 'char' here.
> in particular the int method defers to ICU, but the char method binsearches its own table.
> and the comment for that table is:
> // Unicode 3.0.1 (same as Unicode 3.0.0)
> private static final char[] typeValues ....
> But Unicode 3 is the wrong version for java5/6
> So, i tried a character whose type changed from 3.0 to 4.0, just to see.
> For example, compare these two results:
> Character.getType('\u17B5') = 8 (combining mark)
> Character.getType((int) '\u17B5') = 16 (format)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.