You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/22 10:22:57 UTC

[GitHub] [iceberg] findepi opened a new pull request #2849: Fix bucketing of strings with non-BMP characters

findepi opened a new pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849


   Fixes https://github.com/apache/iceberg/issues/2837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#issuecomment-885234469


   Thanks for fixing this, @findepi! Nice work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#discussion_r674899462



##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java
##########
@@ -215,6 +215,24 @@ public void testString() {
         hashBytes(asBytes), bucketFunc.hash(string));
   }
 
+  @Test
+  public void testStringWithSurrogatePair() {
+    String string = "string with a surrogate pair: 💰";
+    Assert.assertNotEquals("string has no surrogate pairs", string.length(), string.codePoints().count());
+    byte[] asBytes = string.getBytes(StandardCharsets.UTF_8);
+
+    Bucket<CharSequence> bucketFunc = Bucket.get(Types.StringType.get(), 100);
+
+    Assert.assertEquals("String hash should match hash of UTF-8 bytes",
+        hashBytes(asBytes), bucketFunc.hash(string));
+
+    Assert.assertNotEquals("It looks like Guava has been updated and now contains a fix for " +
+                    "https://github.com/google/guava/issues/5648. Please resolve the TODO in BucketString.hash " +
+                    "and remove this assertion",
+            hashBytes(asBytes),
+            MURMUR3.hashString(string, StandardCharsets.UTF_8).asInt());

Review comment:
       Nit: Looks like indentation is off for this statement.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on a change in pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on a change in pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#discussion_r675023438



##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java
##########
@@ -215,6 +215,24 @@ public void testString() {
         hashBytes(asBytes), bucketFunc.hash(string));
   }
 
+  @Test
+  public void testStringWithSurrogatePair() {
+    String string = "string with a surrogate pair: 💰";
+    Assert.assertNotEquals("string has no surrogate pairs", string.length(), string.codePoints().count());
+    byte[] asBytes = string.getBytes(StandardCharsets.UTF_8);
+
+    Bucket<CharSequence> bucketFunc = Bucket.get(Types.StringType.get(), 100);
+
+    Assert.assertEquals("String hash should match hash of UTF-8 bytes",
+        hashBytes(asBytes), bucketFunc.hash(string));
+
+    Assert.assertNotEquals("It looks like Guava has been updated and now contains a fix for " +
+                    "https://github.com/google/guava/issues/5648. Please resolve the TODO in BucketString.hash " +
+                    "and remove this assertion",
+            hashBytes(asBytes),
+            MURMUR3.hashString(string, StandardCharsets.UTF_8).asInt());

Review comment:
       eventually my IDE is not against me on this :) 
   
   fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#issuecomment-885001482


   Looks good to me. I agree with the choice to detect cases where `hashString` can't be used since converting the string to bytes will allocate a new `ByteBuffer` and copy into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#issuecomment-884876098


   cc @alexjo2144 @losipiuk @joshthoward @electrum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#issuecomment-885001980


   @findepi, looks like I don't have permission to update the indentation in the test. Could you fix that and I'll merge this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2849: Fix bucketing of strings with non-BMP characters

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2849:
URL: https://github.com/apache/iceberg/pull/2849#issuecomment-885210486


   Thanks @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org