You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/06/30 10:51:00 UTC
[jira] [Assigned] (ARROW-9133) [C++] Add utf8_upper and utf_lower

     [ https://issues.apache.org/jira/browse/ARROW-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou reassigned ARROW-9133:
-------------------------------------

    Assignee: Maarten Breddels

> [C++] Add utf8_upper and utf_lower
> ----------------------------------
>
>                 Key: ARROW-9133
>                 URL: https://issues.apache.org/jira/browse/ARROW-9133
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Maarten Breddels
>            Assignee: Maarten Breddels
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> This is the equivalent of https://issues.apache.org/jira/browse/ARROW-9100 for utf8. This will be a good test for unilib vs utf8proc, performance, and API wise.
> Also, since Unicode strings can grow and shrink, this is also a good start to think about a strategy for memory allocation.
> How much can a 'string' (or byte sequence) length actually grow? 
> Item 5.18 mentioned that a string can expand by a factor of 3, by which they seem to mean 3 codepoints. This can be validated by checking with Python:
> {code:python}
> for i in range(0x100, 0x110000):
>     codepoint = chr(i)
>     try:
>         bytes_before = codepoint.encode()
>     except UnicodeEncodeError:
>         continue
>     bytes_after = codepoint.upper().encode()
>     if len(bytes_before) != len(bytes_after):
>         print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), len(bytes_after))
> ....
> 912 0x390 ΐ Ϊ́ 2 6
> ...{code}
> showing that a two-byte codepoint can expand to 3 (2 byte) codepoints (2 bytes => 6 bytes). The character Ϊ́ has no single precomposed capital character, so it is composed of a single base character and two combining characters. However there are different situations explain in [https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt])
> This increase by a factor of 3 is used in CPython [https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Objects/unicodeobject.c#L10058] which is an easy solution not to have to grow the buffer dynamically.
> However, growing 3x in size seems at odds with the API of both utf8proc:
> [https://github.com/JuliaStrings/utf8proc/blob/08f9999a0698639f15d07b12c0065a4494f2d504/utf8proc.c#L375]
> [https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79]
> and unilib:
> [https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79]
> Which can only return a single 32bit value (thus 1 codepoint, encoding 1 character). Both libraries seem to ignore the special cases of case mapping (no library uses/downloads SpecialCasing.txt).
> This means that if Arrow wants to support the same features as Python regarding upper casing and lower casing (which means really implementing the Unicode), neither libraries are sufficient.
> There are more edges cases/irregularities. But I propose I start with a version of utf8_lower and utf8_upper that ignore the special cases. 
>  
> PS:
> Another interesting finding is that although upper casing can increase a buffer length by a factor of 3, lowercasing a utf8 string will only increase the byte length by a factor of 3/2 at maximum.
> {code:python}
> for i in range(0x100, 0x110000):
>     codepoint = chr(i)
>     try:
>         bytes_before = codepoint.encode()
>     except UnicodeEncodeError:
>         continue
>     bytes_after = codepoint.lower().encode()
>     if len(bytes_before) != len(bytes_after):
>         print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), len(bytes_after))
> 304 0x130 İ i̇ 2 3
> 570 0x23a Ⱥ ⱥ 2 3
> 574 0x23e Ⱦ ⱦ 2 3
> 7838 0x1e9e ẞ ß 3 2
> 8486 0x2126 Ω ω 3 2
> 8490 0x212a K k 3 1
> 8491 0x212b Å å 3 2
> 11362 0x2c62 Ɫ ɫ 3 2
> 11364 0x2c64 Ɽ ɽ 3 2
> 11373 0x2c6d Ɑ ɑ 3 2
> 11374 0x2c6e Ɱ ɱ 3 2
> 11375 0x2c6f Ɐ ɐ 3 2
> 11376 0x2c70 Ɒ ɒ 3 2
> 11390 0x2c7e Ȿ ȿ 3 2
> 11391 0x2c7f Ɀ ɀ 3 2
> 42893 0xa78d Ɥ ɥ 3 2
> 42922 0xa7aa Ɦ ɦ 3 2
> 42923 0xa7ab Ɜ ɜ 3 2
> 42924 0xa7ac Ɡ ɡ 3 2
> 42925 0xa7ad Ɬ ɬ 3 2
> 42926 0xa7ae Ɪ ɪ 3 2
> 42928 0xa7b0 Ʞ ʞ 3 2
> 42929 0xa7b1 Ʇ ʇ 3 2
> 42930 0xa7b2 Ʝ ʝ 3 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)