You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/08/28 04:21:03 UTC

[GitHub] [airflow] yuqian90 commented on a change in pull request #5458: [AIRFLOW-4495] allow externally triggered dags to run for future 'Exe…

yuqian90 commented on a change in pull request #5458: [AIRFLOW-4495] allow externally triggered dags to run for future 'Exe…
URL: https://github.com/apache/airflow/pull/5458#discussion_r318386767
 
 

 ##########
 File path: airflow/jobs/scheduler_job.py
 ##########
 @@ -684,13 +684,6 @@ def _process_task_instances(self, dag, task_instances_list, session=None):
         active_dag_runs = []
         for run in dag_runs:
             self.log.info("Examining DAG run %s", run)
-            # don't consider runs that are executed in the future
-            if run.execution_date > timezone.utcnow():
-                self.log.error(
-                    "Execution date is in future: %s",
-                    run.execution_date
-                )
-                continue
 
 Review comment:
   @XD-DENG I actually disagree with the statement "If your DagRun's execution_date is in the future, for sure it should not be considered for execution"
   
   We should think about what execution_date is. It is definitely NOT the datetime tasks in the DAG starts running. So there's no reason to make the scheduler ignore tasks with an execution_date in the future. In the current implementation, if execution_date is T, and we set Dag.schedule to a cron expression, the first task in the DAG runs at execution_date (T + one schedule_interval). See https://airflow.apache.org/concepts.html
   "The time Airflow triggers a DAG should equal execution_date plus one schedule interval if cron expression is used: The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period"
   
   If we set DAG.schedule to None and use external trigger to trigger the DAG, naturally we should expect the scheduler to start considering the tasks for execution once the DAG is triggered, even the execution_date may be a future date. In the current implementation, this is not the case. If execution_date is 20190801, the earliest time the tasks in the DAG can be considered by the scheduler is 20190801 00:00 UTC because of this line of code here. So if someone uses external trigger to trigger the dag at 20190731 23:00 UTC with execution_date 20190801, the tasks will not be considered for execution until 20190801 00:00 UTC. 
   
   This is very problematic if you are working with multiple timezones.  E.g if you are in Asia/Tokyo and you want the jinja templates to evaluate ds_nodash to 20190801, you need to make execution_date 20190801. But if you do that, the DAG can never start running before 20190801 00:00 UTC which is 20190801 09:00 Tokyo time. So if you want something to run at 20190801 8am Tokyo time with execution_date 20190801, it is not possible (even if you trigger the DAG before 9am externally). 
   
   Removing this check for DAG.schedule == None seems to be a move in the right direction. Although this pull request needs a lot more testing to make sure it does the right thing. And it needs to give people some warning before this change goes out.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services