You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/17 06:57:08 UTC

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1841: Implement bitmap_distinct function using croaring-rs bitmap

Ted-Jiang commented on pull request #1841:
URL: https://github.com/apache/arrow-datafusion/pull/1841#issuecomment-1042631149


   
   > * I wonder if/how this gets things closer to being able to do distinct on compressed data (in DF's case on dictionary encoded columns). The problem (as I understand it) is that there is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches, or even in the same record batch (if I remember how dictionary concatenation currently works in Arrow).
   
   `There is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches` : yes
   We plan to maintain a global dictionary to encode col(string) into 32-bit int to accelerate count distinct.
   
   > * Would this work on 64-bit columns if they could first be casted to 32-bit? That is, assuming the contents of the 64-bit column actually fit as 32-bit unsigned integers?
   IMO, it will lose front 32 bit info, the result will be incorrect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org