You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "K P (Jira)" <ji...@apache.org> on 2021/06/22 12:49:00 UTC

[jira] [Created] (TEXT-209) LookupTranslator returns count of chars consumed, not of codepoints consumed

K P created TEXT-209:
------------------------

             Summary: LookupTranslator returns count of chars consumed, not of codepoints consumed
                 Key: TEXT-209
                 URL: https://issues.apache.org/jira/browse/TEXT-209
             Project: Commons Text
          Issue Type: Bug
    Affects Versions: 1.9
            Reporter: K P


The contract of {{abstract method translate​(CharSequence }}{{input, int index, Writer out)}} in the class CharSequenceTranslator, and therefore also in the inherited LookupTranslator, is to return the "_int count of codepoints consumed_".

Cf. their javadoc.

 

However, LookupTranslator returns the number of chars.

This can be seen in its source, in its implementation of the abstract method, where it returns "i", which is the length _in chars_ of the longest matching substring.

Test to reproduce:

Define a mapping where a String with 1 supplementary character is mapped to 1 (basic) char.
{code:java}
/* Key: string with Mathematical double-struck capital A (U+1D538) */
String symbol = new StringBuilder().appendCodePoint(0x1D538).toString();

/* Map U+1D538 to "A" */
Map<CharSequence, CharSequence> map = new HashMap<>();
map.put(symbol, "A");

LookupTranslator translator = new LookupTranslator(map);
String translated = translator.translate(symbol + "=A");
		
/* Fails: instead of "A=A", we get "AA". */
assertEquals("A=A", translated);

{code}
So when doing the translation, the supplementary character got mapped, but then you notice that the LookupTranslator erroneously +_swallowed_ the following "=" character+.
That is because its translate method returns the count of matched _chars_ (i.e. 2 = the high and low surrogate code unit  (chars) that form the surrogate pair)  , instead of the count of matched _codepoints_ (i.e. which is 1, and which the javadoc claims to return)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)