You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "rohanjain101 (via GitHub)" <gi...@apache.org> on 2023/03/17 01:36:08 UTC

[GitHub] [arrow] rohanjain101 opened a new issue, #34599: compute.utf8_upper produces different result than str.upper for "ẞ"

rohanjain101 opened a new issue, #34599:
URL: https://github.com/apache/arrow/issues/34599

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ```
   In [1]: import pyarrow as pa
   
   In [2]: pa.compute.utf8_upper("ß")
   Out[2]: <pyarrow.StringScalar: 'ẞ'>
   
   In [3]: pa.__version__
   Out[3]: '11.0.0'
   
   Python str.upper:
   
   >>>char = "ß"
   >>>char.upper()
   'SS'
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] "utf8_upper" kernel produces different result than Python's str.upper for "ẞ" [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #34599: [C++] "utf8_upper" kernel produces different result than Python's str.upper for "ẞ"
URL: https://github.com/apache/arrow/issues/34599


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34599: [C++] "utf8_upper" kernel produces different result than Python's str.upper for "ẞ"

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34599:
URL: https://github.com/apache/arrow/issues/34599#issuecomment-1473321273

   Arrow uses the `utf8proc` C library for UTF8 operations (https://juliastrings.github.io/utf8proc/). 
   
   And this library changed the upper case for "ß" from "SS" to "ẞ" a few years ago: https://github.com/JuliaStrings/utf8proc/issues/130
   
   It seems that there is some discussion about what the correct upper case should be. For example, see also https://bugs.openjdk.org/browse/JDK-8186073 . The unicode standard (http://unicode.org/charts/PDF/U1E00.pdf) mentions:
   
   > The capital letter sharp s is part of the official German
   orthography since 2017. Along with "SS" it is an allowed
   variant spelling of 00DF in "all caps" style
   
   https://www.fileformat.info/info/unicode/char/00df/index.htm mentions *"uppercase is "SS" (standard case mapping), alternatively [U+1E9E](https://www.fileformat.info/info/unicode/char/1e9e/index.htm)"*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34599: [C++] "utf8_upper" kernel produces different result than Python's str.upper for "ẞ"

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34599:
URL: https://github.com/apache/arrow/issues/34599#issuecomment-1473330173

   So in the end, this is not something we can change in Arrow itself. If you want this to change, you will need to bring it up at https://github.com/JuliaStrings/utf8proc/ (but given they changed this a few years back, it might not be likely they would change it again)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org