You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/07/16 23:56:07 UTC

[GitHub] [incubator-iceberg] electrum opened a new issue #293: Truncate transform on strings with Unicode characters

electrum opened a new issue #293: Truncate transform on strings with Unicode characters
URL: https://github.com/apache/incubator-iceberg/issues/293
 
 
   The specification for truncate says
   
   > *Substring of length `L`*
   
   but does not define what it is counting. I assume the intention is for it to be Unicode code points, since the specification says that
   
   > Character strings must be stored as UTF-8 encoded byte arrays
   
   However, the Java reference implementation uses `java.lang.CharSequence#subSequence`, thus the length is in terms of 16-bit code units, and thus is different for code points for characters outside of the Basic Multilingual Plane (BMP). Such code points require two characters, encoded using a high and low surrogate pair. Additionally, the truncation may happen in the middle of the surrogate pair, which is a form of corruption.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org