You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/24 08:55:25 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #405: Hash RepartitionExec should buffer/emit batches based on target batch size

Dandandan opened a new issue #405:
URL: https://github.com/apache/arrow-datafusion/issues/405


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Currently RepartitionExec based on hashing will split the rows of the batch into multiple partitions. This leads to smaller batch sizes (i.e. roughly 20x for 20 partitions), which are concatenated later by a `CoalesceBatchesExec`.
   It is probably beneficial to batch a number of rows (based on target batch size) and only emit the batches when they exceed the configured target batch size. This avoids doing a number of `take` calls on the smaller batches, plus avoiding the overhead of having to concatenate the smaller batches later. If this is implemented, the optimization rule to introduce CoalesceBatches probably should be removed after a  `RepartitionExec`, as there is no benefit / use of doing this anymore.
   
   **Describe the solution you'd like**
   Change the logic, do some benchmarking to show the benefit.
   
   **Describe alternatives you've considered**
   n/a
   **Additional context**
   n/a


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org