You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/26 14:59:59 UTC

[GitHub] [airflow] zikun commented on issue #9914: Add option to prevent most recent run on DAG enable with catchup=False

zikun commented on issue #9914:
URL: https://github.com/apache/airflow/issues/9914#issuecomment-680934697


   Just want to write down my exploration and thoughts which might help if anyone wants to pick it up or discuss it further.
   
   I looked into the scheduling logic and it turns out that the behaviour of catching up the most recent run was not designed deliberately. Rather, it is a "side effect" of the fundamental scheduling logic.
   
   Airflow scheduler runs like a batch job. For every few seconds, it parses the DAGs and determines the next dag_run (if any) for every DAG. For example, for a DAG scheduled to run hourly `0 * * * *`, a scheduler cycle starting at 2020-07-22T02:00:09 will schedule the "next" dag_run `next_execution_time = 2020-07-22T02:00:00` (in other words, `execution_time = 2020-07-22T01:00:00`). Note that although I'm using the word **"next"**, for the scheduler, it is doing something like a look-back to determine and schedule the "next" dag_run. If a DAG is paused at the beginning and is enabled at 2020-07-22T02:30:00, the next scheduler cycle will look back and still find the "next" dag_run to be `next_execution_time = 2020-07-22T02:00:00`. So **it appears to be a catchup but it is really just a longer-delayed scheduling**.
   
   Therefore, if we want to prevent the most-recent catchup, we might have to add some more complex logic and modify the DAG model. For example, adding a timestamp to record when a DAG is enabled so that scheduler can skip most-recent run. I'm not sure if it's worth it for this feature.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org