You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/01 13:06:18 UTC

[GitHub] [airflow] potiuk opened a new pull request #17963: Improve the description of how to handle dynamic task generation

potiuk opened a new pull request #17963:
URL: https://github.com/apache/airflow/pull/17963


   The Top-Level best practices were a little misleading. They
   suggested that no code should be written at the top-level DAG other
   than just creating operators, but the story is a little more nuanced.
   
   Better explanation is give and also examples on how you can deal
   with the situation when you need to generate your data based on
   some meta-data. From Slack discussion it seems that it is not
   obvious at all what are the best ways to handle that so two
   alternatives were presented with generating a meta-data file
   and generating an importable python code containing the meta-data.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910552372


   And yeah. The dynamic task generatilon based on the previous tasks output is on our list as well (and likely for 2.3 too).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910417229


   I've encountered some race-conditions with this kind of dynamic task generation before, typically when trying to update the metadata file and trigger the DAG immediately after. 
   
   Two cases where this shows up:
   
    - DAG_1 writes a metadata file and triggers DAG 2 (which uses the metadata file to generate tasks)
    - Uploading a metadata file and immediately triggering a DAG via REST API
    
   In these cases, the DAG often isn't re-processed between the time that the metadata file is updated and the time that the DAG is triggered, so the tasks which are created are based on the previous version of the metadata file.
   
   I think that this can be a bit of a trap and can definitely be misleading for users who aren't familiar with Airflow's internals. Do you think we can include some warnings about this behaviour in the documentation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-913460047


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-913224012


   Hey @BasPH  - very good comments. I rewrote the whole chapter now, including your comments, added more examples, and more logical sequence of chapters.  I added more references and explanations. 
   
   Please take a look. I'd love to get it for 2.1.4 as it would make a lot easier to answer to a number of user's questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910501495


   Take a look please @SamWheating 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] BasPH commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
BasPH commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-913459726


   LGTM!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#discussion_r702434901



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.

Review comment:
       Oh yeah. The minimum/maximum notion of time/frequency is reversed I will update it to still keep minimum and refer to time rather than frequency to avoid confusion :D




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] BasPH commented on a change in pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
BasPH commented on a change in pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#discussion_r700266850



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of

Review comment:
       ```suggestion
   runs every time Airflow parses an eligible Python file, which happens at the minimum frequency of
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then instead of reading content
+of such generated file from JSON or YAML, you can simply import objects generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra code to find, load and parse

Review comment:
       Would mention this approach first if it's the easier/preferred one.

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.

Review comment:
       Explain why the metadata should be published in the same folder as DAG

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then instead of reading content

Review comment:
       Could you add a list at the start to give an overview of the options before diving in? For example:
   
   ... There are generally two approaches for creating dynamic DAGs ....:
   1. Generate a meta-file (e.g. YAML/JSON) which is interpreted by a Python script which generates tasks/DAGs
   2. Generate a Python object (e.g. dict) from which tasks & DAGs are generated

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators

Review comment:
       Explain why you should avoid top level code

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs similar to

Review comment:
       Again, clarify "it"

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then instead of reading content
+of such generated file from JSON or YAML, you can simply import objects generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra code to find, load and parse
+the meta-data. This sounds strange, but it is surprisingly easy to generate such easy-to-parse and
+valid Python code that you can import from your DAGs.
+
+For example assume you dynamically generate (in your DAG folder), the ``my_company_utils/common.py`` file:
+
+.. code-block:: python
+
+    # This file is generated automatically !
+    ALL_TASKS = ["task1", "task2", "task3"]
+
+Then you should be able to import and use the ``ALL_TASK`` constant in all your DAGs like that:

Review comment:
       ```suggestion
   Then you can import and use the ``ALL_TASKS`` constant in all your DAGs like that:
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.

Review comment:
       Explain why should you prepare the metadata externally?

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then instead of reading content
+of such generated file from JSON or YAML, you can simply import objects generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra code to find, load and parse
+the meta-data. This sounds strange, but it is surprisingly easy to generate such easy-to-parse and
+valid Python code that you can import from your DAGs.
+
+For example assume you dynamically generate (in your DAG folder), the ``my_company_utils/common.py`` file:
+
+.. code-block:: python
+
+    # This file is generated automatically !
+    ALL_TASKS = ["task1", "task2", "task3"]
+
+Then you should be able to import and use the ``ALL_TASK`` constant in all your DAGs like that:
+
+
+.. code-block:: python
+
+    from my_company_utils.common import ALL_TASKS
+
+    with DAG(dag_id="my_dag", schedule_interval=None, start_date=days_ago(2)) as dag:
+        for task in ALL_TASKS:
+            # create your operators and relations here
+            pass
+
+Don't forget that in this case you need to add empty ``__init__.py`` file in the ``my_company_utils`` folder
+and you should add the ``my_company_utils/.*`` line to ``.airflowignore`` file, so that the whole folder is
+ignored by the scheduler when it looks for DAGs.
+
+
+Triggering DAGs after changes
+-----------------------------
+
+Avoid triggering DAGs immediately after changing them or any other accompanying files that you change in the
+DAG folder.
+
+You should give the system sufficient time to process the changed files. This takes several steps.
+First the files have to be distributed to scheduler - usually via distributed filesystem or Git-Sync, then
+scheduler has to parse the python files and store them in the database. Depending on your configuration,
+speed of your distributed filesystem, number of files, number of DAGs, number of changes in the files,
+sizes of the files, number of schedulers, speed of CPUS, this can take from seconds to minutes, in extreme
+cases many minutes. You need to observe yours system to figure out the delays you can experience, you can
+also fine-tune that by adjusting Airflow configuration or increasing resources dedicated to Airflow
+components (CPUs, memory, I/O throughput etc.).

Review comment:
       This sounds very complex and not very useful to a user IMO. How about suggesting to wait for a change to appear in the UI? Also, could you mention which configurations can be tuned in this situation?

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should prepare the meta-data externally.
+For example you can export from your database and publish together with the DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in the same folder as DAG.

Review comment:
       Would replace "it" by "the metadata" here since the metadata is mentioned 2 sentences up, takes some mental thought to understand the sentence.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910531103


   Looks great!
   
   In the future, how would you feel about adding support for waiting for DAGs to update. This could take shape as an operator which either:
   
    - Forces the re-parsing of a file by spinning up a DagFileProcessor on the worker (this is kinda hacky and probably won't work with all executors)
    - Polls the `last_parsed_time` of a DAG and waits until it updates before continuing (this also might not work if there's filesystem latency).
    
   The idea being that we could build one DAG from the result of a task, wait for it to re-parse and then trigger it, which would provide a more stable solution for people trying to provision tasks from the result of a previous DAG or task, which seems to be a common issue:
   
   https://stackoverflow.com/questions/48589633/airflow-dynamic-tasks-at-runtime
   https://stackoverflow.com/questions/52558018/airflow-generate-dynamic-tasks-in-single-dag-task-n1-is-dependent-on-taskn
   
   This is probably a larger conversation, but I'd be happy to help with improved support for runtime-generated DAGs. Let me know what you think 😄 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910627767


   Docs ready to review. I've seen many questions about it and I think it would be great to clarify it and give people an example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910551195


   Yeah. Absolutely. I think this is part of the larger conversation and probably something that should be implemented alongside of the  AIP-5 - remote dag fetcher: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher - it's been around for quite a while but it did not have the highest priority (but I think might be a good candidate for 2.3 alongside DAG Versioning https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #17963:
URL: https://github.com/apache/airflow/pull/17963


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910417229


   I've encountered some race-conditions with this kind of dynamic task generation before, typically when trying to update the metadata file and trigger the DAG immediately after. 
   
   Two cases where this shows up:
   
    - DAG_1 writes a metadata file and triggers DAG 2 (which uses the metadata file to generate tasks)
    - Uploading a metadata file and immediately triggering a DAG via REST API
    
   In these cases, the DAG often isn't re-processed between the time that the metadata file is updated and the time that the DAG is triggered, so the tasks which are created are based on the previous version of the metadata file. I think that this can be a bit of a trap and can definitely be misleading for users who aren't familiar with Airflow's internals. 
   
   Additionally, dynamic task generation will cause previous task instances to disappear from the UI if the latests version of the DAG doesn't have the exact same set of task names.
   
   So while dynamic DAG generation from a changing metadata source is possible, it can lead to some pretty unexpected behaviour. Do you think we can include some warnings about this behaviour in the documentation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-910483040


   Yeah. Why not @SamWheating  - certainly worth it.
   
   However this is not unique to changing dynamic parameters like that. The same story is when you just create or update existing DAG file and trigger dag right after - especially when you have distributed file-system or GitSync that did not manage to sync it yet.
   
   I think this should be something along the "Triggering DAGs after changes". I have to fix docs problems so I can add something.. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#issuecomment-913446065


   @BasPH Any further comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17963: Improve the description of how to handle dynamic task generation

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#discussion_r700664148



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
+You should avoid writing the top level code which is not necessary to create Operators
+and build DAG relations between them. Specifically you should not run any database access,
+heavy computations and networking operations. The code outside the Operator's ``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.

Review comment:
       ```suggestion
   runs every time Airflow parses an eligible python file, which happens at the maximum frequency of once every
   :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` seconds.
   ```
   
   I think that the minimum frequency is unbounded, while the maximum parsing frequency is determined by `min_file_process_interval`.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org