You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/04 09:41:44 UTC

[GitHub] [airflow] eladkal commented on a change in pull request #15183: Move common pitfall documentation to Airflow docs

eladkal commented on a change in pull request #15183:
URL: https://github.com/apache/airflow/pull/15183#discussion_r606774296



##########
File path: docs/apache-airflow/common-pitfall.rst
##########
@@ -0,0 +1,202 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+
+Common Pitfalls
+===============
+
+Airflow Configurations
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configuring parallelism
+-----------------------
+
+These configurations are executor agnostic.
+
+- :ref:`config:core__parallelism`
+
+  The maximum number of task instances that Airflow can run concurrently.
+  This number usually reflects the number of task instances with the
+  running state in the metadata database.
+
+- :ref:`config:core__dag_concurrency`
+
+  The maximum number of task instances to be allowed to run concurrently
+  per DAG. To calculate the number of tasks that is running concurrently
+  for a DAG, add up the number of running tasks for all DAG runs of the DAG.
+  This is configurable at the DAG level with ``concurrency``.
+
+- :ref:`config:core__max_active_runs_per_dag`
+
+  The maximum number of active DAG runs per DAG. The scheduler will not
+  create more DAG runs if it reaches the limit. This is configurable at
+  the DAG level with ``max_active_runs``.
+
+DAG Structure and DAG Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Multiple DAG definitions per Python file
+----------------------------------------
+
+Airflow does support more than one DAG definition per python file, but it is not recommended as Airflow would like
+better isolation between DAGs from a fault and deployment perspective and multiple DAGs in the same file goes against
+that. For now, make sure that the DAG object is in the global namespace for it be recognized by Airflow.
+
+.. code-block:: python
+
+        globals ()[dag_id] = DAG(...)
+
+Refer to :ref:`how to build dynamic DAGs<faq:dynamic_dag>`.
+
+Top level Python code
+----------------------
+
+While it is not recommended to write any code outside of defining Airflow constructs, Airflow does support any
+arbitrary python code as long as it does not break the DAG file processor or prolong file processing time past the
+:ref:`config:core__dagbag_import_timeout` value.
+
+A common example is the violation of the time limit when building a dynamic DAG which usually requires querying data
+from another service like a database. At the same time, the requested service is being swamped with DAG file
+processors requests for data to process the file. These unintended interactions may cause the service to deteriorate
+and eventually cause DAG file processing to fail.
+
+Refer to :ref:`DAG writing best practices<best_practice:writing_a_dag>` for more information.
+
+Double Jinja Templating
+-----------------------
+
+It is not possible to render a Jinja template within another Jinja template. This is commonly attempted in
+``user_defined_macros``.
+
+.. code-block:: python
+
+        dag = DAG(
+            ...
+            user_defined_macros={
+                'my_custom_macro': 'day={{ ds }}'
+            }
+        )
+
+        bo = BashOperator(
+            task_id='my_task',
+            bash_command="echo {{ my_custom_macro }}",
+            dag=dag
+        )
+
+This will echo "day={{ ds }}" instead of "day=2020-01-01" for a dagrun with the execution date 2020-01-01 00:00:00.
+
+.. code-block:: python
+
+        bo = BashOperator(
+            task_id='my_task',
+            bash_command="echo day={{ ds }}",
+            dag=dag
+        )
+
+By using the ds macros directly in the template_field, the rendered value results in "day=2020-01-01".
+
+Operators and Hooks
+^^^^^^^^^^^^^^^^^^^
+
+File templating and file extensions
+-----------------------------------
+
+TemplateNotFound errors are usually due to misalignment with user expectations when passing special values to Operator
+that trigger Jinja templating. A common occurrence is with BashOperators.
+
+Given BashOperator's ``template_fields`` includes ``bash_command`` and ``template_ext`` is a non-empty list, Airflow
+will attempt to render ``bash_command`` with the contents of a file using the parameter value as the file path if
+the parameter value ends in one of the listed file extensions.
+
+.. code-block:: python
+
+        bo = BashOperator(
+            task_id='my_script',
+            bash_command="/usr/local/airflow/include/test.sh",
+            dag=dag
+        )
+
+If you wish to directly executed a bash script, you need to add a space after the script name to prevent Airflow from
+rendering the template using Jinja.

Review comment:
       This is duplicate content. It's explained in https://github.com/apache/airflow/blob/master/docs/apache-airflow/howto/operator/bash.rst#jinja-template-not-found
   
   You can create a ref link to `bash.rst` instead.

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -54,6 +59,12 @@ Some of the ways you can avoid producing a different result -
     You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task.
     The ``default_args`` help to avoid mistakes such as typographical errors.
 
+Creating a custom Operator
+---------------------------
+
+When implementing custom operators, do not make any expensive expensive operations in their ``__init__``. They are
+going to be instantiated once per scheduler run per task using them, and making database calls can significantly slow
+down scheduling and waste resources.

Review comment:
       This is probably something we need to mention in https://github.com/apache/airflow/blob/master/docs/apache-airflow/howto/custom-operator.rst#creating-a-custom-operator and create a ref link there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org