You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/04/28 03:06:34 UTC

[GitHub] [arrow-datafusion] wjones127 commented on pull request #5545: Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded)

wjones127 commented on PR #5545:
URL: https://github.com/apache/arrow-datafusion/pull/5545#issuecomment-1526917997

   Got here as a downstream user who is affected by this change. Thinking through this, I wouldn't totally write off dictionary encoding integers as useless, since there still are benefits to dictionary arrays besides space savings. They essentially mark columns as having low cardinality and provide the set of unique values. Any scalar compute functions run on these columns can be applied to the dictionary while leaving the indices buffer untouched. That is an easy to way to achieve what I would expect out of a "smart" compute engine: when projecting partition columns, project the distinct values rather than the expanded/materialized array. It's possible DataFusion already handles this in a smart way I'm unaware of though.
   
   I'd also note that the ideal partition column types are probably [run-end encoded arrays](https://github.com/apache/arrow-rs/issues/3520) (`RunArray`), once they are implemented.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org