You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/13 10:15:33 UTC

[GitHub] [arrow-ballista] thinkharderdev commented on pull request #59: Support for multi-scheduler deployments

thinkharderdev commented on PR #59:
URL: https://github.com/apache/arrow-ballista/pull/59#issuecomment-1153735022

   > Thanks @thinkharderdev for the great work. The code is well implemented and well documented. And the task scheduling algorithm is also very sophisticated.
   > 
   > I only have two small concerns:
   > 
   > 1. For the stage-based task scheduling, suppose a scenario of 3 stages, stage3 ---> stage2--->stage1. 1K tasks for stage1, 1 task for stage2, and 1K tasks for stage3. When stage2 finishes, it's better to schedule the tasks for stage3 at all once. However, for the algorithm based on the _ExecutorReservation_, it seems this kind of all-at-once scheduling only happens when the job submitted. Maybe better to keep the previous event of _StageFinished_.
   > 2. For every state change for a job, like task status update, this _ExecutionGraph_-based implement needs to fetch and decode the _ExecutionGraph_ from the config backend. Will it be too heavy, especially when there're thousands or millions of tasks for a job? Actually the previous design of keeping the tasks in memory aims to reduce such kind of cost. Therefore, I prefer not to persist the task status info into the config backend.
   > 3. For the job scheduling policy, this implementation makes it possible for one job to be scheduled by multiple schedulers. However, I think it's not necessary. It's better to employ an active-standby policy. And make the recovery level to be stage level rather than task level if the job's active scheduler terminated. Then we can avoid the _ExecutionGraph_ decoding cost for every task update.
   
   Thanks @yahoNanJing for the review.
   
   1. This is a good point. I wanted to avoid locking the executor state as much as possible but I see that the case you mentioned is a degenerate case. 
   2. I am concerned about this as well. For simplicity I put everything into a single data structure but conceptually there is no reason we have to do it that way. We can have the `ExecutionGraph` store only the DAG structure and store the plan information in a separate data structure. Since the plan is immutable that could be cached in memory more effectively. And I think you're right in that the entire task status doesn't need to be stored at all. We only need to know whether it is pending or not. 
   3. On this point I disagree. High availability is one goal of this work but another is horizontal scalability. The work involved in physical planning is sometimes non-trivial. For instance, scanning a large Hive table can involve reading parquet metadata from many files and it will be useful to be able to have multiple active schedulers to scale this work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org