You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Maarten Breddels (Jira)" <ji...@apache.org> on 2020/06/15 12:19:00 UTC
[jira] [Created] (ARROW-9133) [C++] Add utf8_upper and utf_lower

Maarten Breddels created ARROW-9133:
---------------------------------------

             Summary: [C++] Add utf8_upper and utf_lower
                 Key: ARROW-9133
                 URL: https://issues.apache.org/jira/browse/ARROW-9133
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Maarten Breddels


This is the equivalent of https://issues.apache.org/jira/browse/ARROW-9100 for utf8. This will be a good test for unilib vs utf8proc, performance, and API wise.

Also, since Unicode strings can grow and shrink, this is also a good start to think about a strategy for memory allocation.

How much can a 'string' (or byte sequence) length actually grow? 

Item 5.18 mentioned that a string can expand by a factor of 3, by which they seem to mean 3 codepoints. This can be validated by checking with Python:
{code:python}
for i in range(0x100, 0x110000):
    codepoint = chr(i)
    try:
        bytes_before = codepoint.encode()
    except UnicodeEncodeError:
        continue
    bytes_after = codepoint.upper().encode()
    if len(bytes_before) != len(bytes_after):
        print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), len(bytes_after))
....
912 0x390 ΐ Ϊ́ 2 6
...{code}
showing that a two-byte codepoint can expand to 3 (2 byte) codepoints (2 bytes => 6 bytes). The character Ϊ́ has no single precomposed capital character, so it is composed of a single base character and two combining characters. However there are different situations explain in [https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt])

This increase by a factor of 3 is used in CPython [https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Objects/unicodeobject.c#L10058] which is an easy solution not to have to grow the buffer dynamically.

However, growing 3x in size seems at odds with the API of both utf8proc:

[https://github.com/JuliaStrings/utf8proc/blob/08f9999a0698639f15d07b12c0065a4494f2d504/utf8proc.c#L375]

[https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79]

and unilib:

[https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79]

Which can only return a single 32bit value (thus 1 codepoint, encoding 1 character). Both libraries seem to ignore the special cases of case mapping (no library uses/downloads SpecialCasing.txt).

This means that if Arrow wants to support the same features as Python regarding upper casing and lower casing (which means really implementing the Unicode), neither libraries are sufficient.

There are more edges cases/irregularities. But I propose I start with a version of utf8_lower and utf8_upper that ignore the special cases. 

 

PS:

Another interesting finding is that although upper casing can increase a buffer length by a factor of 3, lowercasing a utf8 string will only increase the byte length by a factor of 3/2 at maximum.
{code:python}
for i in range(0x100, 0x110000):
    codepoint = chr(i)
    try:
        bytes_before = codepoint.encode()
    except UnicodeEncodeError:
        continue
    bytes_after = codepoint.lower().encode()
    if len(bytes_before) != len(bytes_after):
        print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), len(bytes_after))
304 0x130 İ i̇ 2 3
570 0x23a Ⱥ ⱥ 2 3
574 0x23e Ⱦ ⱦ 2 3
7838 0x1e9e ẞ ß 3 2
8486 0x2126 Ω ω 3 2
8490 0x212a K k 3 1
8491 0x212b Å å 3 2
11362 0x2c62 Ɫ ɫ 3 2
11364 0x2c64 Ɽ ɽ 3 2
11373 0x2c6d Ɑ ɑ 3 2
11374 0x2c6e Ɱ ɱ 3 2
11375 0x2c6f Ɐ ɐ 3 2
11376 0x2c70 Ɒ ɒ 3 2
11390 0x2c7e Ȿ ȿ 3 2
11391 0x2c7f Ɀ ɀ 3 2
42893 0xa78d Ɥ ɥ 3 2
42922 0xa7aa Ɦ ɦ 3 2
42923 0xa7ab Ɜ ɜ 3 2
42924 0xa7ac Ɡ ɡ 3 2
42925 0xa7ad Ɬ ɬ 3 2
42926 0xa7ae Ɪ ɪ 3 2
42928 0xa7b0 Ʞ ʞ 3 2
42929 0xa7b1 Ʇ ʇ 3 2
42930 0xa7b2 Ʝ ʝ 3 2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)