You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/10 11:24:12 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new issue #1805: Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

mingmwang opened a new issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   A new feature enhancement
   
   **Describe the solution you'd like**
   
   Current Ballista’s shuffle implementation is very similar to Spark’s early version. It’s the hash-based shuffle solution where shuffle data is materialized to disks, each map task produces separate files for each reduce task. For a shuffle operation involves M map tasks and N reduce task, it will generate M*N files. Too many tiny files will cause performance, memory and scalability issues. Later Spark version introduced the sort-based shuffle solution and became the default shuffle implementation. the sort-based shuffle will not generate M*N files, each map task sort the records by the partition id + key and generate a pair of files, all records are consolidate in one data file and an index file is created to manage the data range metadata for different partitions. In Spark 2.0, the hash-based shuffle code was removed. Spark also introduced external shuffle services to serve materialized intermediate shuffle data in order to achieve better fault-tolerance and performance iso
 lation. In the recent Spark 3.2 release, it introduces a push based shuffle solution (SPARK-30602) to further improve the shuffle stability and IO performance. With spark’s push-based shuffle, shuffle is performed at the end of the map tasks and shuffle blocks are pre-merged and pushed to selected reducer nodes or upload to spark external shuffle servers.
   
   Other distributed compute engines like Flink and Presto also support the shuffle operation. But they didn’t materialize the shuffle data to disks, instead, shuffle data is streamingly materialized into an in-memory buffer, the reduce tasks poll the shuffle data from map tasks’ in-memory buffer to minimize the end-to-end latency.
   
   Here, we propose a new streaming style push-based shuffle solution for Ballista. Where shuffle is performed at the end of map tasks. Instead of materializing the intermediate shuffle data to disks and generate M*N files, shuffle data is directly pushed to the reduce tasks via Arrow-Flight gRpc call to achieve very low latency. This is important for low latency queries. The corresponding Stage scheduling will be enhanced to support the All-at-Once scheduling. With all-at-once scheduling, all the stages of a SQL/Job will be scheduled at almost the same time. The distributed DAG of the query is fixed at the beginning,  so that the map tasks can streamingly push the shuffle data to downstream reduce tasks.
   
   I will draft a detailed design doc to cover the proposed API changes later.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1035880945


   Hi, @mingmwang , nice work!
   
   You can open an RFC ticket if you are willing to, then we can discuss based on github. After we get the final version, we can merge it and save the RFC in our codebase. FYI, https://github.com/apache/arrow-datafusion/tree/master/docs/source/specification/rfcs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1046552839


   The design looks good to me, thanks for writing it up @mingmwang , left a minor question in the doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1035900009


   > Can we discuss in the google doc or in this thread directly ? Everyone can comment on the google doc.
   
   Both are ok, just a suggestion. 
   
   > I will open a PR to cover all the related code changes so that everyone who interest can take a look and give me advice.
   
   👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] gaojun2048 commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1052143910


   It's great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1035897465


   > Hi, @mingmwang , nice work!
   > 
   > You can open an RFC ticket if you are willing to, then we can discuss based on github. After we get the final version, we can merge it and save the RFC in our codebase. FYI, https://github.com/apache/arrow-datafusion/tree/master/docs/source/specification/rfcs
   
   Can we discuss in the google doc or in this thread directly ? Everyone can comment on the google doc.
   I will open a PR to cover all the related code changes so that everyone who interest can take a look and give me advice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1044339906


   This sounds great! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1805: [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1805:
URL: https://github.com/apache/arrow-datafusion/issues/1805#issuecomment-1035873416


   Add a design doc for further discussion.
   
   https://docs.google.com/document/d/17J9H6gGBVktmRAFYNQu-v52QUUPlghRnVLIZC3mFYFY/edit?usp=sharing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org