You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "andygrove (via GitHub)" <gi...@apache.org> on 2023/02/08 14:37:34 UTC

[GitHub] [arrow-ballista] andygrove opened a new issue, #660: Proposal for more efficient disk-based shuffle mechanism

andygrove opened a new issue, #660:
URL: https://github.com/apache/arrow-ballista/issues/660

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The current shuffle mechanism is too basic and produces too many small shuffle files.
   
   **Describe the solution you'd like**
   See https://docs.google.com/document/d/16SIEoniAWKSFt8XKDLsOfRQ0sU--5E9_OIyE42Zj808/edit?usp=sharing
   
   **Describe alternatives you've considered**
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1424244803

   I ran into some challenges with the proposed design, which I have documented in the Google doc (near the end). Feedback welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1422707060

   @thinkharderdev @yahoNanJing @mingmwang @avantgardnerio Let me know what you think. If this seems like a good idea, I may have some time later this month to try and implement it. I am also happy for someone else to pick this up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] Dandandan commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1424306553

   I read through it, sounds indeed a bit more simple.
   
   A nice side effect btw of this optimization is that limit on the shufflewriter is also a bit more effective - allowing to quicker terminate the tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] yahoNanJing commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.
yahoNanJing commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1430792369

   Thanks @andygrove for raising the discussion for this topic. 
   
   For the second approach, the optimization change is also limited when tasks of a query stage are assigned to different executors, which is a common case when using the RoundRobin task scheduling policy for load balancing.
   
   Actually, to reduce the shuffle write file, I recommend to use the sort-based shuffle writer used in Spark https://issues.apache.org/jira/browse/SPARK-2045. Then for each original `ShuffleWriterExec`, there will be only 2 output files rather than N files for its downside stage. One file for shuffling data with concatenating all of the output partition data, and the other one for the indexes of each partition's offset in the data file.
   
   An intuitive graph can be find here, https://github.com/blaze-init/blaze/blob/master/dev/doc/architectural_overview.md.
   
   Hi @yjshen, could you share your opinions?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] Dandandan commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1422732705

   Sounds like a great idea to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1436060406

   This sounds great @yahoNanJing. I would also like to hear if @yjshen has an opinion on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org