You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/25 10:00:18 UTC

[GitHub] [airflow] shohamy7 opened a new issue #21089: Scheduler Loading DAGs That Have Not Changed

shohamy7 opened a new issue #21089:
URL: https://github.com/apache/airflow/issues/21089


   ### Apache Airflow version
   
   2.2.3 (latest released)
   
   ### What happened
   
   I have a few DAGs in my dag folder. I used git sync in order copy them into the dag folder.
   I saw the DAGs inside my dag folder, and I saw the last time they have been changed what Jan 24 (I used the `ls -l /opt/airflow/dags/repo/` command in order to check that)
   Example for one DAG that I have in my dag folder:
   `-rw-r--r-- 1 65533 root 4141 **Jan 24 19:30** clear_missing_dags.py`
   When I opened the logs of the scheduler inside the path `/opt/airflow/logs/scheduler/latest/{my_dag_file}.log` and the logs inside the `/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log` I saw that the scheduler load the DAGs in Jan 25 even though they did not change.
   Example for logs from the scheduler logs:
   [2022-01-25 09:46:18,615] {processor.py:654} INFO - DAG(s) dict_keys(['clear_missing_dags']) retrieved from /opt/airflow/dags/repo/clear_missing_dags.py
   [2022-01-25 09:46:18,633] {logging_mixin.py:109} INFO - [2022-01-25 09:46:18,633] {dag.py:2396} INFO - Sync 1 DAGs
   [2022-01-25 09:46:18,655] {logging_mixin.py:109} INFO - [2022-01-25 09:46:18,655] {dag.py:2935} INFO - Setting next_dagrun for clear_missing_dags to None
   [**2022-01-25 09:46:18,676**] {processor.py:171} INFO - Processing /opt/airflow/dags/repo/clear_missing_dags.py took 0.186 seconds
   Example for logs from the dag processor manager:
   DAG File Processing Stats
   
   File Path                                     PID    Runtime      # DAGs    # Errors  Last Runtime    Last Run
   --------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
   /opt/airflow/dags/repo/bash_example.py                                 0           1  0.15s           2022-01-25T09:48:20
   /opt/airflow/dags/repo/branch_datetime.py                              0           1  0.15s           2022-01-25T09:48:26
   /opt/airflow/dags/repo/python_example.py                               1           0  0.20s           2022-01-25T09:48:33
   /opt/airflow/dags/repo/clear_missing_dags.py                           1           0  0.17s           2022-01-25T09:48:20
   ================================================================================
   [**2022-01-25 09:48:48,730**] {manager.py:1065} INFO - Finding 'running' jobs without a recent heartbeat
   [2022-01-25 09:48:48,731] {manager.py:1069} INFO - Failing jobs without heartbeat after 2022-01-25 09:43:48.731074+00:00
   As far as I know, the scheduler checks if the dag has been change (by checking if the date of the file has been change from the last time we loaded the dag)
   I seems like this is not working.
   
   ### What you expected to happen
   
   I expected that the scheduler will not try to load the DAG again until we'll change it.
   
   ### How to reproduce
   
   This happens on the default helm chart deployment (I used `helm install airflow .`).
   You can reproduce it by deploying the chart and creating a dag file inside the dag folder.
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Used the default values from the helm chart and only configured the git-sync option
   
   ### Anything else
   
   This problem happens each time we try to load DAGs. This cause the scheduler to run the cluster policies every X seconds instead of running it only when the DAG has changed
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1021008764


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #21089:
URL: https://github.com/apache/airflow/issues/21089


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] shohamy7 commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
shohamy7 commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1026106524


   Thanks for the response!
   Didn't know about this commit, I think you right about the description change for `min_file_process_interval`.
   I'll glad to open a PR for this changes and contribute to the project.
   
   In addition, I have a another question about your answer:
   How long does it take the DAG to update after the dag file has been modified?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1033861621


   This is as intendeed. Alll DAGs are parsed continuously. No matter if they changed or not - simply because re-parsing of the dag at different times can generate a different DAG (for example if the DAG reads an external file and creates DAG structure based on that the DAG might produce a different DAG if the external file changes. Same with importing external libraries. 
   
   Time of last modification of the DAG only matters for scheduling priority but each DAG will be re-parsed every `min_file_process_interval` seconds. This is how Airlfow works currently. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] julius-ziegler commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
julius-ziegler commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1033639419


   We also had problems with this and looked at the code a while ago and came to the conclusion that it probably does not work the way it is intended.
   
   It attempts to decide if a file needs to be re-parsed here:
   https://github.com/apache/airflow/blob/39e395f9816c04ef2f033eb0b4f635fc3018d803/airflow/dag_processing/manager.py#L973-L983
   
   But `self.get_last_finish_time()` relies on the `self._file_stats`, but that gets initialized empty in the constructor of `DagFileProcessorManager`:
   
   https://github.com/apache/airflow/blob/39e395f9816c04ef2f033eb0b4f635fc3018d803/airflow/dag_processing/manager.py#L445-L446 
   
   And that I think gets created new on each parsing interval (it would be good if someone else could verify this).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1021008764


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] avkirilishin commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
avkirilishin commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1030641791


   > How long does it take the DAG to update after the dag file has been modified?
   
   I think it depends on `parsing_processes` (https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-processes) and `dag_dir_list_interval` for new files (https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-dir-list-interval).
   
   DagFileProcessorManager adds files to the queue and checks modified files only after all files have been processed:
   https://github.com/apache/airflow/blob/39e395f9816c04ef2f033eb0b4f635fc3018d803/airflow/dag_processing/manager.py#L573-L577


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] julius-ziegler edited a comment on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
julius-ziegler edited a comment on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1033639419


   We also had problems with this and looked at the code a while ago and came to the conclusion that it probably does not work the way it is intended.
   
   It attempts to decide if a file needs to be re-parsed here:
   https://github.com/apache/airflow/blob/39e395f9816c04ef2f033eb0b4f635fc3018d803/airflow/dag_processing/manager.py#L973-L983
   
   But `self.get_last_finish_time()` relies on the `self._file_stats`, but that gets initialized empty in the constructor of `DagFileProcessorAgent`:
   
   https://github.com/apache/airflow/blob/39e395f9816c04ef2f033eb0b4f635fc3018d803/airflow/dag_processing/manager.py#L445-L446 
   
   And that I think gets created new on each parsing interval (it would be good if someone else could verify this).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] avkirilishin commented on issue #21089: Scheduler Loading DAGs That Have Not Changed

Posted by GitBox <gi...@apache.org>.
avkirilishin commented on issue #21089:
URL: https://github.com/apache/airflow/issues/21089#issuecomment-1025009337


   You can set `min_file_process_interval` to a large value (for example 86400), `file_parsing_sort_mode` to "modified_time" and `default_timezone` to "system".
   
   I think we have to change the `min_file_process_interval` description due to https://github.com/apache/airflow/commit/add7490145fabd097d605d85a662dccd02b600de
   
   > Number of seconds after which a DAG file is parsed. The DAG file is parsed every min_file_process_interval number of seconds. Updates to DAGs are reflected after this interval **_or after the DAG file modification if `file_parsing_sort_mode` is set to "modified_time"_**. Keeping this number low will increase CPU usage.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org