You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/19 20:44:26 UTC

[GitHub] [iceberg] rdblue edited a comment on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

rdblue edited a comment on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882846235


   @RussellSpitzer, yes. But I think the question is whether we expect anyone to have this problem. I'm not familiar enough with unicode to know whether we would expect regular use in other languages to hit this bug. If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix like the one for negative date values.
   
   If we end up doing that, it should be a matter of updating the projections from string predicates to bucket id predicates. For example, `eq("col", "💰")` should be projected to `eq("col_bucket", 12)` but we need to create `and(eq("col_bucket", 4), eq("col_bucket", 12))` instead to pick up data incorrectly placed in bucket 4. This isn't too bad because we only need to update equality and in predicates because bucket function projection doesn't work for inequalities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org