You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/18 00:51:07 UTC

[GitHub] [iceberg] jun-he opened a new pull request #2836: [Python] support BucketByteBuffer and BucketUUID

jun-he opened a new pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836


   Follow up https://github.com/apache/iceberg/pull/2689#discussion_r650597195


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-884771775


   > Elevating the 'wrong' behavior to be 'the standard' will be hard for non-Java languages.
   
   I agree. I think the last thing we want to do is start duplicating guava bugs in other languages. It is hard and unnecessary. 
   
   Shall we wait for the guava fix to be merged and propagated to Iceberg then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-883103045


   @rymurr @TGooch44 they might mismatch. I will take a look at all the hashcode generated in Python versus Java. This might be a problem for many.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-882778882


   > Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?
   
   It looks like the following matches the expected output(which I guess is different than the Java output):
   ```
   >>> import mmh3
   >>> (mmh3.hash("💰".encode("utf-8")) & 2147483647) % 32
   12
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-882583342






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-882583342


   Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-883438929


   @ryanmurr Unfortunately, I think you're right that matching Java is more
   important than being correct according to spec.  We may want to wait on
   this until the java community decides if they are going to fix this or not.
   
   Seems like they're trying to decide how likely existing users are to have
   been impacted.
   
   On Tue, Jul 20, 2021, 2:06 AM Ryan Murray ***@***.***> wrote:
   
   > I guess python is doing the 'right' thing. The question for me is: will
   > Java start doing the right thing or maintain the wrong thing for backwards
   > compatibility. Python should, I think, be consistent w/ Java.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/pull/2836#issuecomment-883229491>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAMETHZDOQENBDO5ZRB3JCLTYU4ANANCNFSM5ARSWF7A>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-885346091


   Thanks @findepi for the fix.
   
   @TGooch44 @rymurr  I think python should just hash the UTF-8 bytes. I will run a quick check to see if Java and Python match in various cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 merged pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 merged pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-882778882


   > Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?
   
   It looks like the following matches the expected output(which I guess is different than the Java output):
   ```
   >>> import mmh3
   >>> (mmh3.hash("💰".encode("utf-8")) & 2147483647) % 32
   12
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-884982312


   > Shall we wait for the guava fix to be merged and propagated to Iceberg then?
   
   Per @rdblue 's https://github.com/apache/iceberg/issues/2837#issuecomment-884562950 i posted a proposed fix https://github.com/apache/iceberg/pull/2849.
   If it gets merged, the work here would be unblocked.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-883103045


   @rymurr @TGooch44 they might mismatch. I will take a look at all the hashcode generated in Python versus Java. This might be a problem for many.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-883229491


   I guess python is doing the 'right' thing. The question for me is: will Java start doing the right thing or maintain the wrong thing for backwards compatibility. Python should, I think, be consistent w/ Java.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-882583342






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-882778882


   > Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?
   
   It looks like the following matches the expected output(which I guess is different than the Java output):
   ```
   >>> import mmh3
   >>> (mmh3.hash("💰".encode("utf-8")) & 2147483647) % 32
   12
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-886330339


   @findepi  I added additional tests to check the hash values.
   It matched the result in [Java test](https://github.com/apache/iceberg/blob/master/api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java#L64-L119)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-883103045


   @rymurr @TGooch44 they might mismatch. I will take a look at all the hashcode generated in Python versus Java. This might be a problem for many.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 edited a comment on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 edited a comment on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-883438929


   @rymurr Unfortunately, I think you're right that matching Java is more
   important than being correct according to spec.  We may want to wait on
   this until the java community decides if they are going to fix this or not.
   
   Seems like they're trying to decide how likely existing users are to have
   been impacted.
   
   On Tue, Jul 20, 2021, 2:06 AM Ryan Murray ***@***.***> wrote:
   
   > I guess python is doing the 'right' thing. The question for me is: will
   > Java start doing the right thing or maintain the wrong thing for backwards
   > compatibility. Python should, I think, be consistent w/ Java.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/pull/2836#issuecomment-883229491>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAMETHZDOQENBDO5ZRB3JCLTYU4ANANCNFSM5ARSWF7A>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-881979960


   @TGooch44 and @rymurr can you help to review it? Thanks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] TGooch44 commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
TGooch44 commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-884839758


   Waiting seems like the best option here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2836: [Python] support BucketByteBuffer and BucketUUID

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2836:
URL: https://github.com/apache/iceberg/pull/2836#issuecomment-884131235


   Elevating the 'wrong' behavior to be 'the standard' will be hard for non-Java languages.
   Please take a look at my proposed fix in Guava https://github.com/google/guava/pull/5649 for this.
   In order to match Iceberg Java implementation, one would perhaps need to translate that code (without the fix) into Python.
   
   Waiting for Java bucketing version to be fixed seems reasonable though. This avoids correctness issues -- no support for bucketing in Python, means no bugs at all.
   
   FWIW, in Trino, bucketing did not have the bug (that's how we found https://github.com/apache/iceberg/issues/2837), so 'correctly' bucketed data can be out there too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org