You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Dandandan (via GitHub)" <gi...@apache.org> on 2023/04/18 11:57:27 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue, #6043: Improve RoundRobin `RepartitionExec`

Dandandan opened a new issue, #6043:
URL: https://github.com/apache/arrow-datafusion/issues/6043

   ### Describe the bug
   
   RoundRobin repartitioning currently does not distribute the input tasks evenly over the output channels, causing the work to be not distributed evenly.
   
   ### To Reproduce
   
   When loading the data in memory in the TPC-H benchmark, this can be seen in the number of batches in `MemoryExec` (which uses RoundRobin partitioning).
   
   `MemoryExec: partitions=32, partition_sizes=[32, 32, 32, 32, 32, 32, 32, 32, 26, 26, 26, 25, 25, 25, 25, 25, 25, 25, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16], metrics=[]`
   
   It has a bias for the first output partitions/channels.
   
   ### Expected behavior
   
   Batches should be distributed more evenly over output channels.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #6043: Improve RoundRobin `RepartitionExec`

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #6043:
URL: https://github.com/apache/arrow-datafusion/issues/6043#issuecomment-1523249435

   @cristian-ilies-vasile yes, instead of round-robin repartitioning an improved scheme could be implemented based on number of buffered batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] cristian-ilies-vasile commented on issue #6043: Improve RoundRobin `RepartitionExec`

Posted by "cristian-ilies-vasile (via GitHub)" <gi...@apache.org>.
cristian-ilies-vasile commented on issue #6043:
URL: https://github.com/apache/arrow-datafusion/issues/6043#issuecomment-1516898738

   _Batches should be distributed more evenly over output channels._
   
   Seems to be a load balancing issue. If you could count the number of batches already distributed to each channel and not completed then the classical The Power of Two Choices in Randomized Load Balancing algorithm could be evaluated.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] cristian-ilies-vasile commented on issue #6043: Improve RoundRobin `RepartitionExec`

Posted by "cristian-ilies-vasile (via GitHub)" <gi...@apache.org>.
cristian-ilies-vasile commented on issue #6043:
URL: https://github.com/apache/arrow-datafusion/issues/6043#issuecomment-1528708053

   One good article describing this technique can be read here:
   Deterministic Aperture: A distributed, load balancing algorithm
   https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/daperture-load-balancer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org