You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/02 16:33:19 UTC

[GitHub] [arrow-datafusion] alamb commented on pull request #1860: Increase default partition column type from Dict(UInt8) to Dict(UInt16)

alamb commented on pull request #1860:
URL: https://github.com/apache/arrow-datafusion/pull/1860#issuecomment-1057125192


   > The idea behind using UInt8 is that the values of a given partition column within a file will be all identical. If I have to materialize a large array with only zeros, I would rather not encode each 0 on 64 bits 😄. 
   
   I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)
   
   > To actually have a record batch with multiple partition values, you would need to go through something like the concat kernel first. Wouldn't it make sense to rely on that kernel to re-cast the index type appropriately? I think that it would be a safer approach in general to avoid overflowing when merging dictionaries.
   
   Having some way to dynamically pick the size of the dictionary keys certainly seems like a nice feature -- I am not sure how large of a change it would be though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org