You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/19 07:59:26 UTC

[GitHub] [iceberg] findepi opened a new issue #2837: Incorrect bucket value calculated for string with non-BMP characters

findepi opened a new issue #2837:
URL: https://github.com/apache/iceberg/issues/2837


   For example, when bucketing string into 32 buckets, the value `"\uD83D\uDCB0"` (`"💰"`) gets assigned a wrong bucket
   
   ### Expected bucket: 12
   
   
   Per specification, string hash is `hashBytes(utf8Bytes(v))`
   
   ```
   int hash = Hashing.murmur3_32().hashBytes("\uD83D\uDCB0".getBytes(StandardCharsets.UTF_8)).asInt() & Integer.MAX_VALUE;
   int bucket = hash % 32;
   System.out.println("hash = " + hash); // 661122892
   System.out.println("bucket = " + bucket); // 12
   ```
   
   ### Actual bucket: 4
   
   ```
   Integer bucket = new org.apache.iceberg.transforms.Bucket.BucketString(32).apply("\uD83D\uDCB0");
   System.out.println("bucket = " + bucket); // 4
   ```
   
   ### Assumed root cause
   
   `BucketString` code:
   https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239
   
   is affected by https://github.com/google/guava/issues/5648


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-883140345






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882826010






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882846235


   @RussellSpitzer, yes. But I think the question is whether we expect anyone to have this problem. I'm not familiar enough with unicode to know whether we would expect regular use in other languages to hit this bug. If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix like the one for negative date values.
   
   If we end up doing that, it should be a matter of updating the projections from string predicates to bucket id predicates. For example, `eq("col", "💰")` should be projected to `eq("col_bucket", 12)` but we need to create `and(eq("col_bucket", 4), eq("col_bucket", 12))` instead to pick up data incorrectly placed in bucket 4. This isn't too bad because we only need to update equality and in predicates because bucket function projection doesn't work for inequalities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882826010


   Thanks for letting us know about this, @findepi. We'll have to discuss this at the next sync and on the dev list to see whether this is something we expect to need compatibility for, or if we can just fix it by updating how we call murmur3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882826010






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-884875587


   > current plan is to fix the hash implementation
   
   i posted a PR https://github.com/apache/iceberg/pull/2849 with a proposed fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882407116






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882826731


   Do we have a problem if we do fix this? Since data using the old Murmur will be improperly partitioned?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882846235


   @RussellSpitzer, yes. But I think the question is whether we expect anyone to have this problem. I'm not familiar enough with unicode to know whether we would expect regular use in other languages to hit this bug. If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix like the one for negative date values.
   
   If we end up doing that, it should be a matter of updating the projections from string predicates to bucket id predicates. For example, `eq("col", "💰")` should be projected to `eq("col_bucket", 12)` but we need to create `and(eq("col_bucket", 4), eq("col_bucket", 12))` instead to pick up data incorrectly placed in bucket 4. This isn't too bad because we only need to update equality and in predicates because bucket function projection doesn't work for inequalities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #2837:
URL: https://github.com/apache/iceberg/issues/2837


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882826731


   Do we have a problem if we do fix this? Since data using the old Murmur will be improperly partitioned?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882846235


   @RussellSpitzer, yes. But I think the question is whether we expect anyone to have this problem. I'm not familiar enough with unicode to know whether we would expect regular use in other languages to hit this bug. If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix like the one for negative date values.
   
   If we end up doing that, it should be a matter of updating the projections from string predicates to bucket id predicates. For example, `eq("col", "💰")` should be projected to `eq("col_bucket", 12)` but we need to create `and(eq("col_bucket", 4), eq("col_bucket", 12))` instead to pick up data incorrectly placed in bucket 4. This isn't too bad because we only need to update equality and in predicates because bucket function projection doesn't work for inequalities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-884562950


   @findepi, thanks for the extra info about how wide-spread this is.
   
   > Could the string bucketing fix be considered part of v2 breaking changes?
   
   There's no problem with the spec here because the spec requires the hash value to be equivalent to hashing the UTF-8 bytes.
   
   The problem is with the Iceberg reference implementation, so we need to decide how to address that. We discussed this at our sync this morning and the rough consensus was to notify users that this is an issue, but not implement the change to filter projection. There were two main reasons: (1) there are other operations that use bucketing (joins) and bad partition values can't be fixed in those cases and (2) supporting the work-around once Guava is fixed will be difficult and require us to keep the buggy Guava code in Iceberg.
   
   Like you said bucketing by arbitrary string inputs with these characters should be fairly rare. Someone might bucket by a UUID string, but those are predictable and don't use non-BMP characters.
   
   Our current plan is to fix the hash implementation and include the issue in release notes. We're also working on a way to repartition data in an action, so that should help people that are caught by this. In the meantime, we can tell people how to use MERGE INTO to fix the bad data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882826731


   Do we have a problem if we do fix this? Since data using the old Murmur will be improperly partitioned?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882407116


   Also mentioned on user list -- https://mail-archives.apache.org/mod_mbox/iceberg-dev/202107.mbox/%3CCAMrFzbtTTxhYmtE5W4SN1tXdRXFxTntBXLUp4jhPMDT5apRLKA%40mail.gmail.com%3E 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-883264650


   @rdblue could the string bucketing fix be considered part of v2 breaking changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-883140345


   > If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix
   
   Emojis are probably the most common non-BMP symbols but quick search found https://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use where someone mentioned symbols like "𨭎", "𠬠", and "𩷶 (these are probably not regular words?), mathematical symbols and other things.
   
   I would _assume_ values containing arbitrary inputs are never bucketed on though.
   
   
   > but we need to create and(eq("col_bucket", 4), eq("col_bucket", 12)) instead to pick up data incorrectly placed in bucket 4
   
   That's easy to do today -- e.g. just call two different Guava APIs for Murmur3_32 to get numbers 4 and 12.
   Once Guava is changed -- or for any application not using Guava -- it would require copying the affected/buggy hashing algorithm and making it part of the fix.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-882846235


   @RussellSpitzer, yes. But I think the question is whether we expect anyone to have this problem. I'm not familiar enough with unicode to know whether we would expect regular use in other languages to hit this bug. If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix like the one for negative date values.
   
   If we end up doing that, it should be a matter of updating the projections from string predicates to bucket id predicates. For example, `eq("col", "💰")` would normally project as `eq("col_bucket", 4)` but we need to create `and(eq("col_bucket", 4), eq("col_bucket", 12))` instead. This isn't too bad because we only need to update equality and in predicates because bucket function projection doesn't work for inequalities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org