You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/05/19 06:23:00 UTC

[jira] [Work logged] (TEXT-209) LookupTranslator returns count of chars consumed, not of codepoints consumed

     [ https://issues.apache.org/jira/browse/TEXT-209?focusedWorklogId=772286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-772286 ]

ASF GitHub Bot logged work on TEXT-209:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/May/22 06:22
            Start Date: 19/May/22 06:22
    Worklog Time Spent: 10m 
      Work Description: fourAjeff opened a new pull request, #324:
URL: https://github.com/apache/commons-text/pull/324

   Hello,
   This a quick bugfix on the LookupTranslator. The bug returns count of chars consumed, not of codepoints consumed.
   A full description of the problem is found in the ticket: https://issues.apache.org/jira/browse/TEXT-209




Issue Time Tracking
-------------------

            Worklog Id:     (was: 772286)
    Remaining Estimate: 0h
            Time Spent: 10m

> LookupTranslator returns count of chars consumed, not of codepoints consumed
> ----------------------------------------------------------------------------
>
>                 Key: TEXT-209
>                 URL: https://issues.apache.org/jira/browse/TEXT-209
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: K P
>            Priority: Minor
>              Labels: Surrogates, Unicode
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The contract of {{abstract method translate​(CharSequence }}{{input, int index, Writer out)}} in the class CharSequenceTranslator, and therefore also in the inherited LookupTranslator, is to return the "_int count of codepoints consumed_".
> Cf. their javadoc.
>  
> However, LookupTranslator returns the number of chars.
> This can be seen in its source, in its implementation of the abstract method, where it returns "i", which is the length _in chars_ of the longest matching substring.
> Test to reproduce:
> Define a mapping where a String with 1 supplementary character is mapped to 1 (basic) char.
> {code:java}
> /* Key: string with Mathematical double-struck capital A (U+1D538) */
> String symbol = new StringBuilder().appendCodePoint(0x1D538).toString();
> /* Map U+1D538 to "A" */
> Map<CharSequence, CharSequence> map = new HashMap<>();
> map.put(symbol, "A");
> LookupTranslator translator = new LookupTranslator(map);
> String translated = translator.translate(symbol + "=A");
> 		
> /* Fails: instead of "A=A", we get "AA". */
> assertEquals("A=A", translated);
> {code}
> So when doing the translation, the supplementary character got mapped, but then you notice that the LookupTranslator erroneously +_swallowed_ the following "=" character+.
> That is because its translate method returns the count of matched _chars_ (i.e. 2 = the high and low surrogate code unit  (chars) that form the surrogate pair)  , instead of the count of matched _codepoints_ (i.e. which is 1, and which the javadoc claims to return)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)