You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/02/08 07:19:45 UTC

[GitHub] [airflow] ephraimbuddy opened a new pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

ephraimbuddy opened a new pull request #21413:
URL: https://github.com/apache/airflow/pull/21413


   When we make a query to get the total active runs, it returns the dagrun being
   examined as running. Therefore we subtract the dagrun being examined from
   active runs.
   
   There's a bug in this query `dag.get_num_active_runs` but for now we should get this fixed this way
   
   closes: https://github.com/apache/airflow/issues/21083
   closes: https://github.com/apache/airflow/issues/19901
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r803352833



##########
File path: airflow/jobs/scheduler_job.py
##########
@@ -1083,6 +1083,7 @@ def _schedule_dag_run(
         # TODO[HA]: Rename update_state -> schedule_dag_run, ?? something else?
         schedulable_tis, callback_to_run = dag_run.update_state(session=session, execute_callbacks=False)
         if dag_run.state in State.finished:
+            session.flush()  # to update the dag_run state

Review comment:
       I think there's no harm in having a flush here. The dag_run is already in a finished state but a new query at that point doesn't see it as such.
   At https://github.com/apache/airflow/pull/21413/files#diff-bde85feb359b12bdd358aed4106ef4fccbd8fa9915e16b9abb7502912a1c1ab3R1058,  a flush was used when the dag_run was marked failed. I think the same can be done here, plus not running `dag_run.schedule_tis` when the dag_run is in a finished state




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r806020930



##########
File path: airflow/jobs/scheduler_job.py
##########
@@ -1083,6 +1083,7 @@ def _schedule_dag_run(
         # TODO[HA]: Rename update_state -> schedule_dag_run, ?? something else?
         schedulable_tis, callback_to_run = dag_run.update_state(session=session, execute_callbacks=False)
         if dag_run.state in State.finished:
+            session.flush()  # to update the dag_run state

Review comment:
       Moved this to `dag_run.update_state` method




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy merged pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy merged pull request #21413:
URL: https://github.com/apache/airflow/pull/21413


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#issuecomment-1049203508


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r806020930



##########
File path: airflow/jobs/scheduler_job.py
##########
@@ -1083,6 +1083,7 @@ def _schedule_dag_run(
         # TODO[HA]: Rename update_state -> schedule_dag_run, ?? something else?
         schedulable_tis, callback_to_run = dag_run.update_state(session=session, execute_callbacks=False)
         if dag_run.state in State.finished:
+            session.flush()  # to update the dag_run state

Review comment:
       Moved this to `dag_run.update_state` method




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r808711668



##########
File path: airflow/models/dagrun.py
##########
@@ -609,6 +609,7 @@ def update_state(
                 self.data_interval_end,
                 self.dag_hash,
             )
+            session.flush()

Review comment:
       That was my initial thought but I don't know if there would be a regression when added there since it would apply to running `dagruns` as well. It use to have a `commit` down there, after `merge` but was removed here: https://github.com/apache/airflow/commit/73b9163a8f55ce3d5bf6aec0a558952c27dd1b55#diff-649fbbf224bab54417f03338c27d0fdb3c3336e53a522a13dfd9806c99f63137L377
   
   cc: @ashb 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#issuecomment-1033098785


   It turned out that we just needed to flush the session at this point and not that dag.get_num_active_runs is giving a wrong value.
   
   I'm wondering if it's still necessary to run dag_run.schedule_tis for a finished dag_run?
   I'm proposing to change the current code in that area to this:
   ```python
   if dag_run.state in State.finished:
         session.flush()  # to update the dag_run state
         active_runs = dag.get_num_active_runs(only_running=False, session=session)
         # Work out if we should allow creating a new DagRun now?
         if self._should_update_dag_next_dagruns(dag, dag_model, active_runs):
             dag_model.calculate_dagrun_date_fields(dag, dag.get_run_data_interval(dag_run))
   else:
       # This will do one query per dag run. We "could" build up a complex
       # query to update all the TIs across all the execution dates and dag
       # IDs in a single query, but it turns out that can be _very very slow_
       # see #11147/commit ee90807ac for more details
       dag_run.schedule_tis(schedulable_tis, session)
   return callback
   ```
   cc: @ashb @potiuk @uranusjr @jedcunningham 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r803220482



##########
File path: airflow/jobs/scheduler_job.py
##########
@@ -1083,6 +1083,7 @@ def _schedule_dag_run(
         # TODO[HA]: Rename update_state -> schedule_dag_run, ?? something else?
         schedulable_tis, callback_to_run = dag_run.update_state(session=session, execute_callbacks=False)
         if dag_run.state in State.finished:
+            session.flush()  # to update the dag_run state

Review comment:
       It sure feels like if we miss a flush and get into a bad state, we need more to really solve this. We don't want a "one shot to calculate the next dagrun" scenario, right? Or am I missing something?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r808339492



##########
File path: airflow/models/dagrun.py
##########
@@ -609,6 +609,7 @@ def update_state(
                 self.data_interval_end,
                 self.dag_hash,
             )
+            session.flush()

Review comment:
       Overall lgtm, should this be down with the `merge` though so it happens regardless of the ultimate state?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #21413: Fix max_active_runs=1 not scheduling runs when min_file_process_interval is high

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #21413:
URL: https://github.com/apache/airflow/pull/21413#discussion_r811942103



##########
File path: airflow/models/dagrun.py
##########
@@ -609,6 +609,7 @@ def update_state(
                 self.data_interval_end,
                 self.dag_hash,
             )
+            session.flush()

Review comment:
       I just did move this down with `merge` and query count increased...If we are cool with it I will fix the tests




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org