You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/29 06:56:34 UTC

[GitHub] [arrow-datafusion] yahoNanJing opened a new issue #1701: Ballista Enhancement Overview

yahoNanJing opened a new issue #1701:
URL: https://github.com/apache/arrow-datafusion/issues/1701


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*) -->
   
   Current Ballista implementation is more like a POC product for verification of whether it's able to run the Datafusion operators in a distributed way. It helps set up the whole framework and works well for just verification. However, it's a long way to introduce it to the production environment for real cases. This issue mainly raises several aspects we need to consider and to enhance for a more robust distributed execution framework.
   
   In big data era, there're many scenarios. Two common ones are query for interactive analysis and batch processing for ETL purpose. There's no silver bullet. Each scenario has its own characteristics and has its own needs. In the following, I'll describe some enhancement we can do for each scenario.
   
   For both interactive query and batch processing:
   - [Necessary] Able to access remote object store, like HDFS, S3, etc
   - [Necessary] Executor lost handling
   - [Necessary] Configuration management
   - [Nice to have] Schedule stages based on prioprity
   - [Nice to have] Cancel SQL/CancelJob
   - [Nice to have] Executor blacklist
   
   For interactive query:
   - [Necessary] Push-based task assignment
   - [Necessary] Better data exchange
   - [Necessary] Better result fetching
   
   For batch processing:
   - [Necessary] Task speculative scheduling
   - [Necessary] Shuffle fetch failure handling and retry
   - [Necessary] Stage reattempt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] gaojun2048 commented on issue #1701: Ballista Enhancement Overview

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1701:
URL: https://github.com/apache/arrow-datafusion/issues/1701#issuecomment-1052182149


   Is ballista targeting a data computing engine like spark or an ad-hoc query engine like Presto / CK / impala? I believe that our roadmap is different under different goals.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1701: Ballista Enhancement Overview

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1701:
URL: https://github.com/apache/arrow-datafusion/issues/1701#issuecomment-1024861355


   This would be a milestone in Ballista! 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] gaojun2048 commented on issue #1701: Ballista Enhancement Overview

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1701:
URL: https://github.com/apache/arrow-datafusion/issues/1701#issuecomment-1031014610


   Great, I hope I can contribute to these goals as much as I can.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on issue #1701: Ballista Enhancement Overview

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on issue #1701:
URL: https://github.com/apache/arrow-datafusion/issues/1701#issuecomment-1038818256


   > Great, I hope I can contribute to these goals as much as I can.
   
   Hi @gaojun2048, which part are you interested in? Feel free to pick up some tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org