You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 14:56:05 UTC

[GitHub] [arrow-datafusion] rdettai edited a comment on pull request #1860: Increase default partition column type from Dict(UInt8) to Dict(UInt16)

rdettai edited a comment on pull request #1860:
URL: https://github.com/apache/arrow-datafusion/pull/1860#issuecomment-1058110737


   > I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)
   
   I am not challenging that you can have partitions keys with billions of different values 🙂. But I think that this isn't the best place to bump the dictionary index size as it is correct to say that at the file level, you cannot have more than one different value in a partition column for one record batch. It would be nicer to upcast this type downstream, when the record batches are manipulated in a way that implies that this uniqueness doesn't hold anymore (like after a `concat` op).  Also, it would be even nicer if we had https://github.com/apache/arrow-datafusion/issues/1248 instead 😄 
   
   If we find that it is too complex to do it downstream, I am not firmly opposed to upcast the type here, but then I agree with @yjshen that u16 isn't really enough. Also, making it customizable introduces some tuning complexity that isn't really ideal either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org