You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/12/22 14:12:56 UTC

[GitHub] [airflow] andreychernih opened a new issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

andreychernih opened a new issue #20461:
URL: https://github.com/apache/airflow/issues/20461


   ### Apache Airflow version
   
   2.2.2
   
   ### What happened
   
   I have a DAG which start_date was 09/01/2021 initially but then it was changed to 11/01/2021. This DAG has some runs prior to 11/01/2021 that did not get a chance to finish. I can now see that scheduler is still trying to schedule the runs prior to 11/01/2021. But none of the tasks in these runs are starting because of the start_date check I presume. This is maxing out the active runs thus blocking any other days within the DAG range to be scheduled.
   
   <img width="1335" alt="Screen Shot 2021-12-22 at 7 56 22 AM" src="https://user-images.githubusercontent.com/131281/147105749-cc32bfe4-33b1-46fc-8cca-f5e31c92b0e0.png">
   <img width="992" alt="Screen Shot 2021-12-22 at 7 56 29 AM" src="https://user-images.githubusercontent.com/131281/147105764-857d0f85-2877-4ae5-b568-7cbf21c6ca48.png">
   
   
   ### What you expected to happen
   
   Scheduler should not be trying to schedule runs that are prior to DAG's start_date.
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Airflow Docker
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] andreychernih edited a comment on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
andreychernih edited a comment on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1001583189


   This is fine and it actually makes sense that the scheduler won't delete the old runs, however, I don't think it is the right behavior for the scheduler to continue scheduling runs that are earlier than the actual start date of the DAG.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] raphaelauv commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1001568744


   The scheduler is not going to delete dag runs already existing , so you have to manually delete them , I think it's safe this way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1030882744


   Is this the same as #21011 @uranusjr ? WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] avkirilishin commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
avkirilishin commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1030706979


   It's very strange because in 2.2.2 we have strong checks on `max_active_runs` and don't create more DagRuns than this value. And I cannot reproduce it in the current main.
   
   When were these runs created? Did you upgrade Airflow after it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1048910548


   > > @avkirilishin runs were created in 2.2.0 or 2.2.1 likely then Airflow was upgraded to 2.2.2.
   > 
   > @andreychernih I think there are two different problems:
   > 
   > 1. The problem is related to the different logic of the scheduler before and after the update. Maybe there are no tasks in the running dags or something else. Can you show the rows for this dag run in dag, dag_run and task_instance?
   > 2. I agree with you that it is not the right behavior for the scheduler to continue scheduling runs that are earlier than the actual start date of the DAG or Task. It can happen, for example, after turning the dag off and back on. So I made a PR to fix it: [Add dependency to the running_deps #21684](https://github.com/apache/airflow/pull/21684)
   
   Since tasks go through `queued` state before moving to `running`, I think that the scheduler has the check here: https://github.com/apache/airflow/blob/3c12c2e1e5c2d1d961addbe0452186d32800135e/airflow/ti_deps/dependencies_deps.py#L90 ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] avkirilishin commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
avkirilishin commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1049012740


   > Since tasks go through `queued` state before moving to `running`, I think that the scheduler has the check here:
   > 
   > https://github.com/apache/airflow/blob/3c12c2e1e5c2d1d961addbe0452186d32800135e/airflow/ti_deps/dependencies_deps.py#L90
   > 
   > ?
   
   I think `SCHEDULER_QUEUED_DEPS` is actually not equivalent to the logic in the scheduler:
   https://github.com/apache/airflow/blob/3c4524b4ec2b42a8af0a8c7b9d8f1d065b2bfc83/airflow/ti_deps/dependencies_deps.py#L67-L75


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] andreychernih commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
andreychernih commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1001583189


   This is fine and it actually makes sense that the scheduler won't delete the old runs, however, I don't think it is the right behavior for the scheduler to continue scheduling runs that are earlier than the actual start date of the DAG run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] avkirilishin commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
avkirilishin commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1046051990


   > @avkirilishin runs were created in 2.2.0 or 2.2.1 likely then Airflow was upgraded to 2.2.2.
   
   @andreychernih I think there are two different problems:
   
   1) The problem is related to the different logic of the scheduler before and after the update. Maybe there are no tasks in the running dags or something else. Can you show the rows for this dag run in dag, dag_run and task_instance?
   
   2) I agree with you that it is not the right behavior for the scheduler to continue scheduling runs that are earlier than the actual start date of the DAG or Task. It can happen, for example, after turning the dag off and back on. So I made a PR to fix it: https://github.com/apache/airflow/pull/21684


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1046077336


   (Sorry I missed this) I think #21011 is different. That one schedules the task (incorrectly) _at_ `start_date` even if that time does not lie on the schedule, and the fix is to delay the first run to a time after `start_date` that matches the schedule. The problem description here says however the tun is scheduled _before_ `start_date`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] andreychernih commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
andreychernih commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1043535978


   @avkirilishin runs were created in 2.2.0 or 2.2.1 likely then Airflow was upgraded to 2.2.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #20461: Airflow is trying to schedule tasks prior to DAG's start_date

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #20461:
URL: https://github.com/apache/airflow/issues/20461#issuecomment-1054030332


   What I think we should do but I have not tried any:
   
   1) Make sure that we don't create dagruns prior to a DAG's start date
   or
   2) Exclude the task instances here: https://github.com/apache/airflow/blob/f0bbb9d1079e2660b4aa6e57c53faac84b23ce3d/airflow/jobs/scheduler_job.py#L280-L289
   
   I think the solution you have right now, will have the scheduler put those task instances in queued state and never move them to `running` state which might not be a very good experience for users. WDYT @avkirilishin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org