You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/05/19 06:23:00 UTC
[jira] [Work logged] (TEXT-209) LookupTranslator returns count of chars consumed, not of codepoints consumed
[ https://issues.apache.org/jira/browse/TEXT-209?focusedWorklogId=772286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-772286 ]
ASF GitHub Bot logged work on TEXT-209:
---------------------------------------
Author: ASF GitHub Bot
Created on: 19/May/22 06:22
Start Date: 19/May/22 06:22
Worklog Time Spent: 10m
Work Description: fourAjeff opened a new pull request, #324:
URL: https://github.com/apache/commons-text/pull/324
Hello,
This a quick bugfix on the LookupTranslator. The bug returns count of chars consumed, not of codepoints consumed.
A full description of the problem is found in the ticket: https://issues.apache.org/jira/browse/TEXT-209
Issue Time Tracking
-------------------
Worklog Id: (was: 772286)
Remaining Estimate: 0h
Time Spent: 10m
> LookupTranslator returns count of chars consumed, not of codepoints consumed
> ----------------------------------------------------------------------------
>
> Key: TEXT-209
> URL: https://issues.apache.org/jira/browse/TEXT-209
> Project: Commons Text
> Issue Type: Bug
> Affects Versions: 1.9
> Reporter: K P
> Priority: Minor
> Labels: Surrogates, Unicode
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The contract of {{abstract method translate(CharSequence }}{{input, int index, Writer out)}} in the class CharSequenceTranslator, and therefore also in the inherited LookupTranslator, is to return the "_int count of codepoints consumed_".
> Cf. their javadoc.
>
> However, LookupTranslator returns the number of chars.
> This can be seen in its source, in its implementation of the abstract method, where it returns "i", which is the length _in chars_ of the longest matching substring.
> Test to reproduce:
> Define a mapping where a String with 1 supplementary character is mapped to 1 (basic) char.
> {code:java}
> /* Key: string with Mathematical double-struck capital A (U+1D538) */
> String symbol = new StringBuilder().appendCodePoint(0x1D538).toString();
> /* Map U+1D538 to "A" */
> Map<CharSequence, CharSequence> map = new HashMap<>();
> map.put(symbol, "A");
> LookupTranslator translator = new LookupTranslator(map);
> String translated = translator.translate(symbol + "=A");
>
> /* Fails: instead of "A=A", we get "AA". */
> assertEquals("A=A", translated);
> {code}
> So when doing the translation, the supplementary character got mapped, but then you notice that the LookupTranslator erroneously +_swallowed_ the following "=" character+.
> That is because its translate method returns the count of matched _chars_ (i.e. 2 = the high and low surrogate code unit (chars) that form the surrogate pair) , instead of the count of matched _codepoints_ (i.e. which is 1, and which the javadoc claims to return)
--
This message was sent by Atlassian Jira
(v8.20.7#820007)