You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2023/01/09 17:38:12 UTC

[GitHub] [airflow] potiuk commented on issue #26933: Low priority tasks are scheduled before high priority tasks

potiuk commented on issue #26933:
URL: https://github.com/apache/airflow/issues/26933#issuecomment-1375998549

   > We need to somehow make sure hat the dependent tasks of a finishing tasks are put into the waiting queue, before scheduling any new tasks for the now free slot
   
   In case you missed it @christianbrugger  - this is precisely what @ashb described as:
   
   > some kind of "look ahead" 
   
   What you described is PRECISELY "look ahead". I think there are many complex problems involved if you try do do such look ahead. Basically whatever approach you come up with you end up with some capacity loss (also as @ashb explained). Another similar case to look at is what procesors do where they execute mutiple branches at the same time hoping that one of them will succeed. Here is the same - we do not KNOW if the task that is currrently running will succeed or not and processing and scheduling further DAG runs will depend on that. And it can get super complex when there are complex dags with multiple dependencies.
   
   One of the problems to solve (and this is one of many) - you have to find a way to create those Task Instances in multiple variants (because previous tasks might succeed/fail/skip and each different state will likely trigger different set of DagRuns to create) - this is your queue. And to discard those DAGRuns that are not needed - when the conditions that you assumed will be fulfilled, won't be actually fullfilled. And it can propagate further and further.  if your DAG has multiple "layers" this might quickly become really complex. The number of potential DAGRun combination to consider grows fast - even exponentially fast - with each layer - and you need to do proper house-keeping on those already created Task instances. 
   Even if you just focus on "all green" scenario you need to at the very least housekeep those task instances waiting in the queue. 
   
   This is ONE of the problems. There are many more where you have to decide about priorities of processing those tasks - some ways of handling the queues might lead to starvation of other dags and eventually you might slow-down entire system significantly if you want to handle it at scale.
   
   There are a number of papers written on similar algorithms, and If anything, I believe a good search on possible science and math there should be taken into account, because there are many problems that might occur during such "look ahead". 
   
   So yeah. I agree with @ashb (and I think this is also what @Taragolis hinted us). At this stage, it is a discussion at most, unless we have a very concrete proposal in the form of Airflow Improvement Proposal that would explain all the details of such a propsal. In which case we can scrutinize it and comment (as all other AIPs), discuss, vote and then posisbly someone can implement it. 
   
   This is neither "bug" nor "feature" - it's a property of the current scheduler implementation and we might improve it in the future, but not without a detailed and well thought proposal that will discuss edge cases and consider scale.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org