You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/28 22:46:38 UTC

[GitHub] [arrow-datafusion] Dandandan commented on pull request #1500: Add support for PartitionBy functionality

Dandandan commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1002310491


   Thanks for showing interest in this functionality!
   
   I believe this functionality shouldn't be a part of the `RepartitionExec`. The partitions (confusingly) mentioned there are a unit of parallelism and distributing data, not a way to aggregate data like in the example for writing to different partitions. The partitions in `PartitionExec` are equivalent to Spark partitions and mirror that functionality / design.
   
   We already have some pieces that we could use to implement the write support:
   * hash aggregation, i.e. `group by`. This could be used to group the same rows and append them to the output directories / files. `HashAggregateExec` has different modes, local and global. For writing, only local mode is needed, as we can write the rows to multiple files within the same directory (just like Spark does). 
   * `RepartitionExec::Hash`, allowing the parallelize the operations on more CPUs and nodes in Ballista. This is already utilized by using the hash aggregate code.
   
   The missing piece here is the support/wiring to *write* the final partitions to directories / files using a partitioning scheme.
   
   I don't have to much time ATM to write up a proposal myself, but if you see which direction I am going to, I support to write a design for this to collect feedback on it.
   
   It would be wise to look after some similar engines, like Spark/Hive/Trino/etc. for some inspiration.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org