You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Thomas Neidhart (JIRA)" <ji...@apache.org> on 2013/05/16 23:37:16 UTC

[jira] [Resolved] (LANG-862) CharSequenceTranslator causes StringIndexOutOfBoundsException during translation of unicode codepoints with length > 1 character

     [ https://issues.apache.org/jira/browse/LANG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Neidhart resolved LANG-862.
----------------------------------

       Resolution: Duplicate
    Fix Version/s: 3.2
    
> CharSequenceTranslator causes StringIndexOutOfBoundsException during translation of unicode codepoints with length > 1 character
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-862
>                 URL: https://issues.apache.org/jira/browse/LANG-862
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.text.translate.*
>    Affects Versions: 3.1
>         Environment: OS X, Java 1.6
>            Reporter: Michael Houston
>              Labels: bug, text, unicode
>             Fix For: 3.2
>
>
> When translating a string with unicode characters in, I've encountered an index exception:
> {code}
> 	java.lang.StringIndexOutOfBoundsException: String index out of range: 50
> 	at java.lang.String.charAt(String.java:686)
> 	at java.lang.Character.codePointAt(Character.java:2335)
> 	at org.apache.commons.lang3.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:95)
> 	at org.apache.commons.lang3.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:59)
> 	at org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUtils.java:556)
> 	...
> {code}
> The input string was from a twitter status:
> org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas suit for this rainy weather \ud83d\udc4d");
> Both those characters are 'Invalid' unicode characters, so presumably there is a conversion error somewhere. However, this shouldn't cause the translator to crash.
> At line 94, the loop which generates the exception increments the position by the size of the codepoint, which seems to grow faster than the number of characters. I don't really know how codepoints work, but it looks to me like there are two indexes which are treated as if they are the same one by this loop:
>  * pt is incrementing by one character each iteration
>  * pos is incrementing by one or more characters each iteration
>  * pos is being used to index into the character array
>  * pt is the value actually being tested in the loop test, so pos can be bigger than pt, causing an index problem at the end of the array
> My guess would be that the loop should read something like:
> {code}
>             for (int pt = 0; pt < consumed;) {
>                 int count = Character.charCount(Character.codePointAt(input, pos));
>                 pt += count;
>                 pos += count;
>             }
> {code}
> I'm not sure if that was the intention, hope it makes some sense!
> Stepping through that code with the input string " \ud83d\udc4d":
> * the input string becomes " \ud83d\udc4d\u008d" (appended 'Reverse Line Feed' - no idea why)
> * consumed == 4
> * Iterating the loop gives pt=0, pos=0 -> pt=1, pos=1 -> pt=2, pos=3 -> pt-3, pos=4 (Index exception)
> So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the index off by one after that.
> Anyway, hope that helps,
> Regards,
> Mike.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira