You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/17 10:23:03 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

mingmwang opened a new issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848


   **Describe the bug**
   
   The issue is caused by the changes [1677](https://github.com/apache/arrow-datafusion/pull/1677)
   which always use the ExecutionContext from the SchedulerServer.
   
   Before the change, run TPCH benchmark Q1 on Ballista:
   
   [2022-02-16T08:47:59Z INFO  ballista_scheduler] Adding stage 1 with 1 pending tasks
   [2022-02-16T08:47:59Z INFO  ballista_scheduler] Adding stage 2 with 2 pending tasks
   [2022-02-16T08:47:59Z INFO  ballista_scheduler] Adding stage 3 with 1 pending tasks
   
   After the change:
   
   [2022-02-16T08:44:57Z INFO  ballista_scheduler] Adding stage 1 with 1 pending tasks
   [2022-02-16T08:44:57Z INFO  ballista_scheduler] Adding stage 2 with 8 pending tasks
   [2022-02-16T08:44:57Z INFO  ballista_scheduler] Adding stage 3 with 1 pending tasks.
   
   A clear and concise description of what the bug is.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   **Expected behavior**
   
   SchedulerServer should honor the configuration settings from the ExecuteQueryParams.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1042855277


   Will do. I think there are a couple of different ways we can approach this:
   
   1. Have the client specify a namespace in the request and use a `ExecutionContext`-per-namespace on the scheduler. We could then dynamically create new contexts whenever a new namespace comes in. 
   2. Have the scheduler dynamically set target partitions based on executor statistics (e.g. number of available task slots). This would I think require a way to set the target partitions explicitly when creating a sql plan. So maybe add a new method to `ExecutionContext` like
   
   `pub async fn sql(&mut self, sql: &str, target_partitions: usize) -> Result<Arc<dyn DataFrame>>`
   
   Or both. 1 may be necessary anyway to support multi-tenancy but we may still, within a single namespace, want to allow specifying shuffle settings on a per-query basis. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang closed issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
mingmwang closed issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1042798316


   @thinkharderdev Please take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1042814275


   I think we need to introduce a session level state to hold any session specific configurations instead of global shared ExecutionContext/ExecutionContextState. We might have a shared Ballista Scheduler, different users might submit SQLs with different sql configurations or shuffle settings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1043794840


   > Will do. I think there are a couple of different ways we can approach this:
   > 
   > 1. Have the client specify a namespace in the request and use a `ExecutionContext`-per-namespace on the scheduler. We could then dynamically create new contexts whenever a new namespace comes in.
   > 2. Have the scheduler dynamically set target partitions based on executor statistics (e.g. number of available task slots). This would I think require a way to set the target partitions explicitly when creating a sql plan. So maybe add a new method to `ExecutionContext` like
   > 
   > `pub async fn sql(&mut self, sql: &str, target_partitions: usize) -> Result<Arc<dyn DataFrame>>`
   > 
   > Or both. 1 may be necessary anyway to support multi-tenancy but we may still, within a single namespace, want to allow specifying shuffle settings on a per-query basis.
   
   I would prefer to let the users choose the target partition at the current phase. Target partition should not be changed too dynamically, otherwise the runtime distributed physical plan will not be stable and could introduce additional shuffle exchanges. In future we might add some kind of adaptive methods to adjust the target partition size based on input/output data volume. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1085381045


   The issue is fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1043804661


   Beside the target partition count, I think there are couple of other configuration options that could be specified by the users and can be changed dynamically, for example, batch_size, parquet_pruning, repartition_windows etc.
   
   I searched the open issues and found there are couple of configuration related issues that are still open. 
   
   [138](https://github.com/apache/arrow-datafusion/issues/138)
   [682](https://github.com/apache/arrow-datafusion/issues/682)
   
   I think it is time to resolve those and come up with a more extensible configuration design.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on issue #1848: settings in ExecuteQueryParams is omitted by the Ballista's scheduler.execute_query(), cause wrong partition count

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #1848:
URL: https://github.com/apache/arrow-datafusion/issues/1848#issuecomment-1042855813


   Also, good catch! Apologies for overlooking this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org