You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/08/17 23:50:21 UTC

[GitHub] [airflow] potiuk opened a new pull request, #25780: Implement ExternalPythonOperator

potiuk opened a new pull request, #25780:
URL: https://github.com/apache/airflow/pull/25780

   This Operator works very similarly to PythonVirtualenvOperator - but
   instead of creating a virtualenv dynamically, it expects the
   env to be available in the environment that Airlfow is run in.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on pull request #25780: Implement PythonOtherenvOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1229022042

    > I think I figured out a good proposal that fits very well both - requirements of being similar to virtualenv and being "correct" in terms of not referring to virtualenv.
   > 
   > My proposal (and it's already updated in the PR) is:
   > 
   >     * PythonOtherenvOperator
   > 
   >     * @task.otherenv decorator
   > 
   > 
   > I think this addresses all the concerns, it is short, easy to remember and use and also has very close resemblance to PythonVirtualenvOperator to show that it is closer to it than to PythonOperator and it does not imply Virtualenv.
   > 
   > Few doubts I had (and I made some choices that could be changed still):
   > 
   >     * PythonOtherEnvOperator vs PythonOtherenvOperator -> I think the latter is better even if slightly less "correct" - it also matches well the decorator (we have no casing in decorator by convention)
   > 
   >     * @task.python_otherenv vs. @task.otherenv  -> I think the latter is better: shorter and more close to @task.virtualenv too.
   > 
   >     * still there is one reference to virtualenv - we have `virtualenv_string_args` still created as 'global' variable accessible in the task. Changing it would be backwards-incompatible, and I think it's not worth to handle it differently.
   > 
   > 
   > Let me know what you think @o-nikolas @uranusjr @ashb. Does it look `good-enough` for all of you :)
   
   I think "Other" is a bit too vague. I personally prefer a name be a bit on the longer side and then be very declarative/clear, rather than saving a few characters and it becoming terse and vague. Maybe dropping the "Pre" from the last name would be a compromise: `PythonExistingVirtualenvOperator`? Even, though I personally like "Preexisting" (it's just three more letters long and it is  __very__ clear what the meaning is) I think "Existing" could be a good middle ground.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234179026

   > would it make sense to add a disclaimer/warning in the case this operator is run in KubernetesExecutor or kubernetesCeleryExecutor (k8S queue) ?
   
   What disclaimer? It shoudl work, there are no limits for that. It's perfectly fine to run the operator with KubernetesExecutor. I can easily imagine this being used when you have just one single image with multiple predefined envs and you want to choose which one to use.
   
   What problems do you foresee with that @raphaelauv  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234175667

   would it make sense to add a disclaimer/warning in the case this operator is run in KubernetesExecutor or kubernetesCeleryExecutor (k8S  queue) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948536082


##########
airflow/utils/decorators.py:
##########
@@ -50,3 +51,33 @@ def wrapper(*args, **kwargs):
         return func(*args, **kwargs)
 
     return cast(T, wrapper)
+

Review Comment:
   Moved it here because it should be here - it was not moved from 'venv' as itis now used in more places (I left an import for back-compat there).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948876982


##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   Yep. This works:
   ````
           op_kwargs: Dict[str, Any] = dict()
           op_kwargs.update(self.op_kwargs)
           if self.templates_dict:
               op_kwargs['templates_dict'] = self.templates_dict
   
   ```



##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   Yep. This works:
   ```
           op_kwargs: Dict[str, Any] = dict()
           op_kwargs.update(self.op_kwargs)
           if self.templates_dict:
               op_kwargs['templates_dict'] = self.templates_dict
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1228312031

   (Sorry, been on a short break this week)
   
   I have a question about the name, specifically the "Virtualenv" part: Is this specific to a virtual environment? Wouldn't it also work with, for example, a Python2 binary?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1235736013

   :D ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950703427


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -93,3 +93,28 @@ def callable_virtualenv():
 
         virtualenv_task = callable_virtualenv()
         # [END howto_operator_python_venv]
+
+        # [START howto_operator_external_python]
+        @task.external_python(task_id="virtualenv_python", python="/ven/bin/python")
+        def callable_external_python():
+            """
+            Example function that will be performed in a virtual environment.
+
+            Importing at the module level ensures that it will not attempt to import the
+            library before it is installed.
+            """
+            from time import sleep
+
+            from colorama import Back, Fore, Style
+
+            print(Fore.RED + 'some red text')
+            print(Back.GREEN + 'and with a green background')
+            print(Style.DIM + 'and in dim text')
+            print(Style.RESET_ALL)
+            for _ in range(10):
+                print(Style.DIM + 'Please wait...', flush=True)
+                sleep(10)

Review Comment:
   Yeah. Can be shorter decreased it . But Actually those are part of the docs and they are displayed there ar, so extracting common util with cloud their role as an example. In this case DAMP is better than DRY and for a good reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948876982


##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   Yep. This works:
   ```
           op_kwargs: Dict[str, Any] = dict()
           op_kwargs.update(self.op_kwargs)
           if self.templates_dict:
               op_kwargs['templates_dict'] = self.templates_dict
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948977015


##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   Actualy it's both - @uranusjr. I just have a case where `--all-files` produces different result than just changes from this commit, so likely the hypothesis of having to add .pyi is right. Will experiment a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r949022510


##########
airflow/decorators/__init__.pyi:
##########
@@ -124,6 +126,40 @@ class TaskDecoratorCollection:
     @overload
     def virtualenv(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...
     @overload
+    def external_python(
+        self,
+        *,
+        python_fspath: str = None,
+        multiple_outputs: Optional[bool] = None,
+        # 'python_callable', 'op_args' and 'op_kwargs' since they are filled by
+        # _PythonVirtualenvDecoratedOperator.
+        use_dill: bool = False,
+        templates_dict: Optional[Mapping[str, Any]] = None,
+        show_return_value_in_logs: bool = True,
+        **kwargs,
+    ) -> TaskDecorator:
+        """Create a decorator to convert the decorated callable to a virtual environment task.
+
+        :param python_fspath: Full time path string (file-system specific) that points to a Python binary inside
+            a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+            (so usually start with "/" or "X:/" depending on the filesystem/os used).
+        :param multiple_outputs: If set, function return value will be unrolled to multiple XCom values.
+            Dict will unroll to XCom values with keys as XCom keys. Defaults to False.
+        :param use_dill: Whether to use dill to serialize
+            the args and result (pickle is default). This allow more complex types
+            but requires you to include dill in your requirements.
+        :param templates_dict: a dictionary where the values are templates that
+            will get templated by the Airflow engine sometime between
+            ``__init__`` and ``execute`` takes place and are made available
+            in your callable's context after the template has been applied.
+        :param show_return_value_in_logs: a bool value whether to show return_value
+            logs. Defaults to True, which allows return value log output.
+            It can be set to False to prevent log output of return value when you return huge data
+            such as transmission a large amount of XCom to TaskAPI.
+        """
+    @overload
+    def external_python(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...

Review Comment:
   Right



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951868224


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your

Review Comment:
   Ah yes, that extra little bit of context makes this clear now. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948977015


##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   Actualy it's both - @uranusjr. I just have a case where `--all-files` produces different result (success) than just changes from this commit, so likely the hypothesis of having to add .pyi is right. Will experiment a bit.
   
   This are the errors I get when I only run mypy on the changed files in this commit:
   
   ```
   airflow/operators/python.py:277: error: Argument 3 to "skip" of "SkipMixin" has
   incompatible type "Collection[Union[BaseOperator, MappedOperator]]"; expected
   "Sequence[BaseOperator]"  [arg-type]
                       self.skip(dag_run, execution_date, downstream_tasks)
                                                          ^
   airflow/operators/python.py:282: error: Argument 3 to "skip" of "SkipMixin" has
   incompatible type "Iterable[DAGNode]"; expected "Sequence[BaseOperator]"
   [arg-type]
       ...              self.skip(dag_run, execution_date, context["task"].get_d...
                                                           ^
   Found 2 errors in 1 file (checked 1 source file)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956376050


##########
airflow/decorators/__init__.pyi:
##########
@@ -41,6 +42,7 @@ __all__ = [
     "task_group",
     "python_task",
     "virtualenv_task",
+    "preexisting_virtualenv_task",

Review Comment:
   I think I found a better name. Stay tuned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #25780:
URL: https://github.com/apache/airflow/pull/25780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1227699643

   Would love to get it merged before 2.4 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951866235


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+

Review Comment:
   No worries at all! I only speak one language so I can't judge anyone else :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951849964


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+

Review Comment:
   Thanks @o-nikolas :). English Grammar is not my strongest skill :). We don't have articles in Polish at all, that's why I almost never know when to use a/the :). I really appreciate you taking time to look at it and correct it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951810315


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use

Review Comment:
   ```suggestion
   However, it is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.

Review Comment:
   ```suggestion
   run tasks with those).
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new

Review Comment:
   ```suggestion
   the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.

Review Comment:
   ```suggestion
     generally higher than when running a virtual environment task. Also, the resources used are somewhat duplicated.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.

Review Comment:
   ```suggestion
   * Resource re-use is still OK but a little less fine grained than in the case of running task via virtual environment.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where

Review Comment:
   ```suggestion
   * The environment used to run the tasks enjoys the optimizations and immutability of containers, where a
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in

Review Comment:
   ```suggestion
   Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. Those require that Airflow runs in a
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here
+  is a must to be able to reliably maintain your deployment.
+* Your python callable has to be serializable if you want to run it via decorators, also in this case
+  all dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* You need to understand more details about how Docker Containers or Kubernetes work. The abstraction
+  provided by those two are "leaky", so you need to understand a bit more about resources, networking,
+  containers etc. in order to Author DAG that uses those operators.
+
+
+Using multiple Docker Images and Celery Queues
+----------------------------------------------
+
+There is a possibility (though it requires a deep knowledge of Airflow deployment) to run Airflow tasks
+using multiple, independent Docker images. This can be achieved via allocating different tasks to different
+Queues and configuring your Celery workers to use different images for different Queues. This however
+(at least currently) requires a lot of manual deployment configuration and intrinsic knowledge of how
+Airflow, Celery and Kubernetes works. Also it introduce quite some overhead for running the tasks - there
+are less chances for resource reuse and it's much more difficult to fine-tune such a deployment for
+cost of resources without impacting the performance and stability.
+
+One of the possible ways to make it more useful is
+`AIP-46 Runtime isolation for Airflow tasks and DAG parsing <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing>`_.
+and completion of `AIP-43 DAG Processor Separation <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation>`_
+Until those are implemented, there are very little benefits of using this approach and it is not recommended.
+
+When those AIPs are implemented, however, this will open up the possibility of more multi-tenant approach,
+where multiple teams will be able to have completely isolated set of dependencies that will be used across
+all the lifecycle of DAG execution - from parsing to execution.

Review Comment:
   ```suggestion
   the full lifecycle of a DAG - from parsing to execution.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here
+  is a must to be able to reliably maintain your deployment.
+* Your python callable has to be serializable if you want to run it via decorators, also in this case
+  all dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* You need to understand more details about how Docker Containers or Kubernetes work. The abstraction
+  provided by those two are "leaky", so you need to understand a bit more about resources, networking,
+  containers etc. in order to Author DAG that uses those operators.
+
+
+Using multiple Docker Images and Celery Queues
+----------------------------------------------
+
+There is a possibility (though it requires a deep knowledge of Airflow deployment) to run Airflow tasks
+using multiple, independent Docker images. This can be achieved via allocating different tasks to different
+Queues and configuring your Celery workers to use different images for different Queues. This however
+(at least currently) requires a lot of manual deployment configuration and intrinsic knowledge of how
+Airflow, Celery and Kubernetes works. Also it introduce quite some overhead for running the tasks - there
+are less chances for resource reuse and it's much more difficult to fine-tune such a deployment for
+cost of resources without impacting the performance and stability.
+
+One of the possible ways to make it more useful is
+`AIP-46 Runtime isolation for Airflow tasks and DAG parsing <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing>`_.
+and completion of `AIP-43 DAG Processor Separation <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation>`_
+Until those are implemented, there are very little benefits of using this approach and it is not recommended.

Review Comment:
   ```suggestion
   Until those are implemented, there are very few benefits of using this approach and it is not recommended.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+

Review Comment:
   Very fantastic and detailed doc Jarek! :tada: 
   
   If read through entirely it should give folks who are new to Airflow a good summary of the different task execution options but also they will learn more about how the Airflow execution environment itself works (where dependencies are sourced from, how tasks can conflict on shared execution hosts, etc).
   
   It looks like a lot of comments below, but there was a lot of text to read through :) Most of the comments are just grammar/naming.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here
+  is a must to be able to reliably maintain your deployment.
+* Your python callable has to be serializable if you want to run it via decorators, also in this case
+  all dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* You need to understand more details about how Docker Containers or Kubernetes work. The abstraction
+  provided by those two are "leaky", so you need to understand a bit more about resources, networking,
+  containers etc. in order to Author DAG that uses those operators.

Review Comment:
   ```suggestion
     containers etc. in order to author a DAG that uses those operators.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``

Review Comment:
   ```suggestion
   a very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here
+  is a must to be able to reliably maintain your deployment.
+* Your python callable has to be serializable if you want to run it via decorators, also in this case
+  all dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* You need to understand more details about how Docker Containers or Kubernetes work. The abstraction
+  provided by those two are "leaky", so you need to understand a bit more about resources, networking,
+  containers etc. in order to Author DAG that uses those operators.
+
+
+Using multiple Docker Images and Celery Queues
+----------------------------------------------
+
+There is a possibility (though it requires a deep knowledge of Airflow deployment) to run Airflow tasks
+using multiple, independent Docker images. This can be achieved via allocating different tasks to different
+Queues and configuring your Celery workers to use different images for different Queues. This however
+(at least currently) requires a lot of manual deployment configuration and intrinsic knowledge of how
+Airflow, Celery and Kubernetes works. Also it introduce quite some overhead for running the tasks - there
+are less chances for resource reuse and it's much more difficult to fine-tune such a deployment for
+cost of resources without impacting the performance and stability.
+
+One of the possible ways to make it more useful is
+`AIP-46 Runtime isolation for Airflow tasks and DAG parsing <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing>`_.
+and completion of `AIP-43 DAG Processor Separation <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation>`_
+Until those are implemented, there are very little benefits of using this approach and it is not recommended.
+
+When those AIPs are implemented, however, this will open up the possibility of more multi-tenant approach,

Review Comment:
   ```suggestion
   When those AIPs are implemented, however, this will open up the possibility of a more multi-tenant approach,
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here

Review Comment:
   ```suggestion
     and iteration time when you work on new versions might be much longer. An appropriate deployment pipeline here
   ```
   
   Might want to expand on what is "appropriate' here.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here
+  is a must to be able to reliably maintain your deployment.
+* Your python callable has to be serializable if you want to run it via decorators, also in this case
+  all dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* You need to understand more details about how Docker Containers or Kubernetes work. The abstraction
+  provided by those two are "leaky", so you need to understand a bit more about resources, networking,
+  containers etc. in order to Author DAG that uses those operators.
+
+
+Using multiple Docker Images and Celery Queues
+----------------------------------------------
+
+There is a possibility (though it requires a deep knowledge of Airflow deployment) to run Airflow tasks
+using multiple, independent Docker images. This can be achieved via allocating different tasks to different
+Queues and configuring your Celery workers to use different images for different Queues. This however
+(at least currently) requires a lot of manual deployment configuration and intrinsic knowledge of how
+Airflow, Celery and Kubernetes works. Also it introduce quite some overhead for running the tasks - there

Review Comment:
   ```suggestion
   Airflow, Celery and Kubernetes works. Also it introduces quite some overhead for running the tasks - there
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while

Review Comment:
   ```suggestion
   ``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators) while
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment

Review Comment:
   ```suggestion
     cannot change it on the fly, adding new or changing requirements requires at least an Airflow re-deployment
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one

Review Comment:
   ```suggestion
     In case of both Docker and Kubernetes operator, running tasks requires at least two processes - one
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow

Review Comment:
   ```suggestion
     process (running in Docker Container or Kubernetes Pod) executing the task, and another process supervising the Airflow
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators, while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PreexistingPythonVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use Docker Operator or Kubernetes Pod Operator. Those require that Airflow runs in
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those.
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+It is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Complete isolation between tasks. They cannot influence one another in other ways than using standard
+  Airflow XCom mechanisms.
+
+The drawbacks:
+
+* There is an overhead to start the tasks. Usually not as big as when creating virtual environments dynamically,
+  but still significant (especially for Kubernetes Pod Operator).
+* Resource re-use is still OK but a little less fine grained than in case of running task via virtual environment.
+  There is an overhead that each running container and Pod introduce, depending on your deployment, but it is
+  generally higher than when running virtual environment task. Also, there is somewhat duplication of resources used.
+  In case of both Docker and Kubernetes operator, running tasks requires always at least two processes - one
+  process (running in Docker Container or Kubernetes Pod) executing the task, and supervising Airflow
+  Python task that submits the job to Docker/Kubernetes and monitors it's execution.
+* Your environment needs to have the container images prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be much longer. Appropriate deployment pipeline here
+  is a must to be able to reliably maintain your deployment.
+* Your python callable has to be serializable if you want to run it via decorators, also in this case
+  all dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* You need to understand more details about how Docker Containers or Kubernetes work. The abstraction
+  provided by those two are "leaky", so you need to understand a bit more about resources, networking,
+  containers etc. in order to Author DAG that uses those operators.
+
+
+Using multiple Docker Images and Celery Queues
+----------------------------------------------
+
+There is a possibility (though it requires a deep knowledge of Airflow deployment) to run Airflow tasks
+using multiple, independent Docker images. This can be achieved via allocating different tasks to different
+Queues and configuring your Celery workers to use different images for different Queues. This however
+(at least currently) requires a lot of manual deployment configuration and intrinsic knowledge of how
+Airflow, Celery and Kubernetes works. Also it introduce quite some overhead for running the tasks - there
+are less chances for resource reuse and it's much more difficult to fine-tune such a deployment for
+cost of resources without impacting the performance and stability.
+
+One of the possible ways to make it more useful is
+`AIP-46 Runtime isolation for Airflow tasks and DAG parsing <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing>`_.
+and completion of `AIP-43 DAG Processor Separation <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation>`_
+Until those are implemented, there are very little benefits of using this approach and it is not recommended.
+
+When those AIPs are implemented, however, this will open up the possibility of more multi-tenant approach,
+where multiple teams will be able to have completely isolated set of dependencies that will be used across

Review Comment:
   ```suggestion
   where multiple teams will be able to have completely isolated sets of dependencies that will be used across
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1236728942

   Looking forward to merging that one, if there are no more comments. @ashb, you review is blocking here, and I believe the name change is addressed already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r962836873


##########
tests/decorators/test_external_python.py:
##########
@@ -0,0 +1,101 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import datetime
+import sys
+from datetime import timedelta
+from subprocess import CalledProcessError
+
+import pytest
+
+from airflow.decorators import task
+from airflow.utils import timezone
+
+DEFAULT_DATE = timezone.datetime(2016, 1, 1)
+END_DATE = timezone.datetime(2016, 1, 2)
+INTERVAL = timedelta(hours=12)
+FROZEN_NOW = timezone.datetime(2016, 1, 2, 12, 1, 1)
+
+TI_CONTEXT_ENV_VARS = [
+    'AIRFLOW_CTX_DAG_ID',
+    'AIRFLOW_CTX_TASK_ID',
+    'AIRFLOW_CTX_EXECUTION_DATE',
+    'AIRFLOW_CTX_DAG_RUN_ID',
+]
+
+
+PYTHON_VERSION = sys.version_info[0]
+
+# Technically Not a separate virtualenv but should be good enough for unit tests
+PYTHON = sys.executable
+
+
+class TestExternalPythonDecorator:
+    def test_add_dill(self, dag_maker):
+        @task.external_python(python=PYTHON, use_dill=True)
+        def f():
+            """Ensure dill is correctly installed."""
+            import dill  # noqa: F401

Review Comment:
   Does this test actually cover anything? For creating a venv I can see this makes sense but  not sure about for this operator. (At least the comment here is incorrect anyway)



##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.
+    :param templates_dict: a dictionary where the values are templates that
+        will get templated by the Airflow engine sometime between
+        ``__init__`` and ``execute`` takes place and are made available
+        in your callable's context after the template has been applied
+    :param templates_exts: a list of file extensions to resolve while
+        processing templated fields, for examples ``['.sql', '.hql']``
+    """
+
+    template_fields: Sequence[str] = tuple({'python_path'} | set(PythonOperator.template_fields))
+
+    def __init__(
+        self,
+        *,
+        python: str,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[Mapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if not python:
+            raise ValueError("Python Path must be defined in ExternalPythonOperator")
+        self.python = python
+        super().__init__(
+            python_callable=python_callable,
+            use_dill=use_dill,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            string_args=string_args,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+
+    def execute_callable(self):
+        python_path = Path(self.python)
+        if not python_path.exists():
+            raise ValueError(f"Python Path '{python_path}' must exists")
+        if not python_path.is_file():
+            raise ValueError(f"Python Path '{python_path}' must be a file")
+        if not python_path.is_absolute():
+            raise ValueError(f"Python Path '{python_path}' must be an absolute path.")
+        python_version_as_list_of_strings = self._get_python_version_from_venv()
+        if (
+            python_version_as_list_of_strings
+            and str(python_version_as_list_of_strings[0]) != str(sys.version_info.major)
+            and (self.op_args or self.op_kwargs)
+        ):
+            raise AirflowException(
+                "Passing op_args or op_kwargs is not supported across different Python "
+                "major versions for ExternalPythonOperator. Please use string_args."
+                f"Sys version: {sys.version_info}. Venv version: {python_version_as_list_of_strings}"
+            )
+        with TemporaryDirectory(prefix='tmd') as tmp_dir:
+            tmp_path = Path(tmp_dir)
+            return self._execute_python_callable_in_subprocess(python_path, tmp_path)
+
+    def _get_virtualenv_path(self) -> Path:
+        return Path(self.python).parents[1]
+
+    def _get_python_version_from_venv(self) -> List[str]:
+        try:
+            result = subprocess.check_output([self.python, "--version"], text=True)
+            return result.strip().split(" ")[-1].split(".")
+        except Exception as e:
+            raise ValueError(f"Error while executing {self.python}: {e}")
+
+    def _get_airflow_version_from_venv(self) -> Optional[str]:
+        try:
+            result = subprocess.check_output(
+                [self.python, "-c", "from airflow import version; print(version.version)"], text=True
+            )
+            venv_airflow_version = result.strip()
+            if venv_airflow_version != airflow_version:
+                raise AirflowConfigException(
+                    f"The version of Airflow installed in the virtualenv {self._get_virtualenv_path()}: "
+                    f"{venv_airflow_version} is different than the runtime Airflow version: "
+                    f"{airflow_version}. Make sure your environment has the same Airflow version "
+                    f"installed as the Airflow runtime."
                 )
-                raise
+            return venv_airflow_version
+        except Exception as e:
+            self.log.info("When checking for Airflow installed in venv got %s", e)
+            self.log.info(
+                f"This means that Airflow is not properly installed in the virtualenv "
+                f"{self._get_virtualenv_path()}. Airflow context keys will not be available. "
+                f"Please Install Airflow {airflow_version} in your venv to access them."
+            )

Review Comment:
   One thought here: Part of the primary reason for using an external Python is to be able to run a task which has conflicts with Core Airflow (for instance some dbt-core libs are tricky to import with airflow right now I think) so is it worth being able to silence this warning somehow?



##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.

Review Comment:
   This whole string_args thing was always a bit odd/confusing to me. I wonder if we should not add it to this new operator?



##########
airflow/example_dags/example_python_operator.py:
##########
@@ -32,6 +35,14 @@
 
 log = logging.getLogger(__name__)
 
+PYTHON = sys.executable
+
+BASE_DIR = tempfile.gettempdir()
+EXTERNAL_PYTHON_ENV = Path(BASE_DIR, "venv-for-system-tests")

Review Comment:
   Nit(ish): The parameterization of this probably should be in tests/dags, not `airflow/example_dags`
   
   Reason: example_dags often show up for new users, and seeing this env var here would be a bit confusing to them
   
   Edit: Is this even used anywhere?



##########
tests/decorators/test_external_python.py:
##########
@@ -0,0 +1,101 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import datetime
+import sys
+from datetime import timedelta
+from subprocess import CalledProcessError
+
+import pytest
+
+from airflow.decorators import task
+from airflow.utils import timezone
+
+DEFAULT_DATE = timezone.datetime(2016, 1, 1)
+END_DATE = timezone.datetime(2016, 1, 2)
+INTERVAL = timedelta(hours=12)
+FROZEN_NOW = timezone.datetime(2016, 1, 2, 12, 1, 1)
+
+TI_CONTEXT_ENV_VARS = [
+    'AIRFLOW_CTX_DAG_ID',
+    'AIRFLOW_CTX_TASK_ID',
+    'AIRFLOW_CTX_EXECUTION_DATE',
+    'AIRFLOW_CTX_DAG_RUN_ID',
+]
+
+
+PYTHON_VERSION = sys.version_info[0]
+
+# Technically Not a separate virtualenv but should be good enough for unit tests
+PYTHON = sys.executable
+
+
+class TestExternalPythonDecorator:
+    def test_add_dill(self, dag_maker):
+        @task.external_python(python=PYTHON, use_dill=True)
+        def f():
+            """Ensure dill is correctly installed."""
+            import dill  # noqa: F401
+
+        with dag_maker():
+            ret = f()
+
+        ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_fail(self, dag_maker):
+        @task.external_python(python=PYTHON)
+        def f():
+            raise Exception
+
+        with dag_maker():
+            ret = f()
+
+        with pytest.raises(CalledProcessError):
+            ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_with_args(self, dag_maker):
+        @task.external_python(python=PYTHON)
+        def f(a, b, c=False, d=False):
+            if a == 0 and b == 1 and c and not d:
+                return True
+            else:
+                raise Exception
+
+        with dag_maker():
+            ret = f(0, 1, c=True)
+
+        ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_return_none(self, dag_maker):
+        @task.external_python(python=PYTHON)
+        def f():
+            return None
+
+        with dag_maker():
+            ret = f()
+
+        ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_nonimported_as_arg(self, dag_maker):

Review Comment:
   Don't this case is needed either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1237323124

   Added all changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234406082

   > thanks for your answer , it's really clear +1
   
   BTW. Funny thing - the customer uses Nomad not K8S, and CeleryExecutor, but it would not change a thing for them if they did use K8S.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950573854


##########
airflow/providers/docker/decorators/docker.py:
##########
@@ -27,7 +27,8 @@
 
 from airflow.decorators.base import DecoratedOperator, task_decorator_factory
 from airflow.providers.docker.operators.docker import DockerOperator
-from airflow.utils.python_virtualenv import remove_task_decorator, write_python_script
+from airflow.utils.decorators import remove_task_decorator

Review Comment:
   This is a bit easier than try/import dance I think,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948833559


##########
airflow/decorators/__init__.pyi:
##########
@@ -124,6 +126,40 @@ class TaskDecoratorCollection:
     @overload
     def virtualenv(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...
     @overload
+    def external_python(
+        self,
+        *,
+        python_fspath: str = None,

Review Comment:
   Yeah. I wanted to use `python` here but thought it was a bit `bare` but if it feels natural for you as well, I will change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950191964


##########
airflow/providers/docker/decorators/docker.py:
##########
@@ -27,7 +27,8 @@
 
 from airflow.decorators.base import DecoratedOperator, task_decorator_factory
 from airflow.providers.docker.operators.docker import DockerOperator
-from airflow.utils.python_virtualenv import remove_task_decorator, write_python_script
+from airflow.utils.decorators import remove_task_decorator

Review Comment:
   This is going to need a bit of a dance to make it work with backcompat



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonOtherenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1229268718

   > That way , things stay simple and explicit at the same time
   
   This is where we started and we wanted ato change it, but I am also Ok with turning PythonVirtualenv into multipurpose one. That makes it even closer to PythonVirtualenvOperator (and I think it should be) but it's even more "virtulaenv" than than before and @uranusjr and @ashb have apparently problem that this operator can use Python which is not in a venv. But I think it's like 9X% uses of the operator, so I am also OK with merging it back into PythonVirtualenvOperator. This is close to the docs I created where really those two only differ by the fact whether they are created at runtime or earlier, but all the other parameters remain the same.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951853331


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your

Review Comment:
   This is mostly result of earlier question at the devlist "should we allow the user to also add extra requirements similar as in PythonVirtualenvOperator? 
   
   My answer is "no". The env should be immutable - if we allow to mix "preexisting" but "dynamic" virtualenvs, this opens up a host of edge cases. This is why immutable is mentioned. We might want to add more info about it if you think it is unclear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963011372


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -32,6 +35,14 @@
 
 log = logging.getLogger(__name__)
 
+PYTHON = sys.executable
+
+BASE_DIR = tempfile.gettempdir()
+EXTERNAL_PYTHON_ENV = Path(BASE_DIR, "venv-for-system-tests")

Review Comment:
   Removed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963058505


##########
tests/decorators/test_external_python.py:
##########
@@ -0,0 +1,101 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import datetime
+import sys
+from datetime import timedelta
+from subprocess import CalledProcessError
+
+import pytest
+
+from airflow.decorators import task
+from airflow.utils import timezone
+
+DEFAULT_DATE = timezone.datetime(2016, 1, 1)
+END_DATE = timezone.datetime(2016, 1, 2)
+INTERVAL = timedelta(hours=12)
+FROZEN_NOW = timezone.datetime(2016, 1, 2, 12, 1, 1)
+
+TI_CONTEXT_ENV_VARS = [
+    'AIRFLOW_CTX_DAG_ID',
+    'AIRFLOW_CTX_TASK_ID',
+    'AIRFLOW_CTX_EXECUTION_DATE',
+    'AIRFLOW_CTX_DAG_RUN_ID',
+]
+
+
+PYTHON_VERSION = sys.version_info[0]
+
+# Technically Not a separate virtualenv but should be good enough for unit tests
+PYTHON = sys.executable
+
+
+class TestExternalPythonDecorator:
+    def test_add_dill(self, dag_maker):
+        @task.external_python(python=PYTHON, use_dill=True)
+        def f():
+            """Ensure dill is correctly installed."""
+            import dill  # noqa: F401

Review Comment:
   Oh yeah. I was too lazy. I did it properly now - and I converted the tests to use dynamically created venvs (and added no_dill/dill tests).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
uranusjr commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1221461338

   The problem with `PythonPreexistingVirtualenvOperator` is it ideologically misleads the user to think it can only reference a virtual environment, while functionally it can work against much more (for example, you can set `python="/usr/bin/python"` to use system packages installed by apt, even if Airflow is not installed against that Python interpreter). This leads to worse discoverability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950509479


##########
airflow/providers/docker/decorators/docker.py:
##########
@@ -27,7 +27,8 @@
 
 from airflow.decorators.base import DecoratedOperator, task_decorator_factory
 from airflow.providers.docker.operators.docker import DockerOperator
-from airflow.utils.python_virtualenv import remove_task_decorator, write_python_script
+from airflow.utils.decorators import remove_task_decorator

Review Comment:
   I aactually imported it already in the old place :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1221384721

   I think this one is ready for review:
   
   * tests should be passing now
   * system Test is working (and uses external venv)
   * documentation is updated (howtos)
   * I even added Best Practices chapter about `Handling Python Dependencies` that provides more insights on how you can run your tasks using different sets of dependencies - it discusses pros/cons of different approaches


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234236371

   > Yeah it will work , I'm just concerned about "encouraging" users to create `just one single image with multiple predefined envs` . Sometimes users create chaotic stack design because something work out of box :)
   
   I am not sure if we want to do it to be honest. I do not think we should encourage it at all (we should present it as an option and we do) because everyone's mileage is different. I spoke to a few users of Airlfow (my customers) and it really depends what stage and experience you have IMHO.
   
   1) some customers who do not have many "system" dependencies problems and are not "super" excited about infrastructure and build pipelinesf for it, will likely prefer single image with multiple venvs (actually the idea of this operator came precisely from that discussion - I met them yesterday and they cannot wait for 2.4 being available because it solves huge problem they have with multiple teams in a simple way.
   
   2) but there are more sophisticated customers who have multiple separate teams that have multiple complex requirements (including system dependencies).  There only multiple container images will cut it. Extreme case of it - think GPU and ARM support for one team crossed with having to install Python 2.7 on an old debian because this is the only environment all dependencies will be installable (as much as we would not like it - Python 2.7 is still poptular in gaming industry apparently - Unity added Python binding few years ago with Python 2.7 only https://docs.unity3d.com/Packages/com.unity.scripting.python@2.0/manual/installation.html  and it is still 2.7 only !!! :scream: ). This is of course extreme, but you get the idea.
   
   So rather than advising the user to choose one over the other, I chose a different route, similar to the "installation" page - - https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html - if you look at the "best practices" chapter in my PR I simply describe all the options and explain pros and cons of each approach and consequences of doing so. This description is precisely targeted for the users who will attempt to ask us "which is the best approach". 
   
   Since we cannot answer this question authoritatively IMHO and we do not want to engage in long discussions with each user (this does not scale) to figure out which option is best for the particular user, we will simply send the user to that page, which they will be able to read and decide on their own. We simply cannot make the decisions for them, but we give them all the information in the way that they can make the decision themselves.
   
   I tried to make this "best practices" chapter to be unbiased, factual and very precisely describing pros and cons of each approach and they are grouped in one chapter progresslvely going from the simplest (PythonVirtualenv) to the most complex and involved (Kubernetes). 
   
   I think this is the best we can do.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234690758

   Looking forward to get it merged :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963010384


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -32,6 +35,14 @@
 
 log = logging.getLogger(__name__)
 
+PYTHON = sys.executable
+
+BASE_DIR = tempfile.gettempdir()
+EXTERNAL_PYTHON_ENV = Path(BASE_DIR, "venv-for-system-tests")

Review Comment:
   Not any more. I removed it after @mik-laj comments and replaced it with sys.executable. But not completly it seems :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951846006


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -20,10 +20,11 @@
 Best Practices
 ==============
 
-Creating a new DAG is a two-step process:
+Creating a new DAG is a three-step process:
 
 - writing Python code to create a DAG object,
-- testing if the code meets our expectations
+- testing if the code meets our expectations,
+- running the DAG in production

Review Comment:
   ```suggestion
   - configuring environment dependencies to run your DAG
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1224369612

   Hey @uranusjr - WDYT? Did I manage to convince you ? (Any more review comments are most welcome too). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956399143


##########
tests/utils/test_preexisting_python_virtualenv_decorator.py:
##########
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+import unittest
+
+from airflow.utils.decorators import remove_task_decorator
+
+
+class TestPreexistingPythonVirtualenvDecorator(unittest.TestCase):
+    def test_remove_task_decorator(self):
+        py_source = "@task.preexisting_virtualenv(use_dill=True)\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "def f():\nimport funcsigs"
+
+    def test_remove_decorator_no_parens(self):
+
+        py_source = "@task.preexisting_virtualenv\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "def f():\nimport funcsigs"
+
+    def test_remove_decorator_nested(self):
+
+        py_source = "@foo\n@task.preexisting_virtualenv\n@bar\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "@foo\n@bar\ndef f():\nimport funcsigs"
+
+        py_source = "@foo\n@task.preexisting_virtualenv()\n@bar\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "@foo\n@bar\ndef f():\nimport funcsigs"

Review Comment:
   Right. I love that. It was just renamed (badly) file that I did not touch. But that's OK for me to change to pytest style. I actually learned to love Pytest style of test!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonOtherenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1229097113

   I am thinking that we should start voting at the devlist for that one :). 
   
   At the end this is the 2nd hardest of the 2 hardest problems in CS:
   
   > There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.
   
   😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on pull request #25780: Implement PythonOtherenvOperator

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1229263521

   Could it be possible that this operator become the new PythonVirtualenvOperator ?
   
   and keep by default the creation of a new venv ( like is already working the exisiting PythonVirtualenvOperator)
   
   meaning
   ```python
   class PythonVirtualenvOperator:
   
     def __ini__(  ...,
                         existingVirtualEnvPath:Optional[str] = None,
                         ....
   
   ```
   
   That way , things stay simple and explicit at the same time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234173064

   Ready for final review I think !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1221325050

   I like the `PythonPreexistingVirtualenvOperator` name best


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951889888


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+

Review Comment:
   I added one more sentence to clarify (and rebased too).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1228346742

   I'm also not 100% sold on "PreexistingVirtualEnv", I will ponder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1228526187

   > I'm also not 100% sold on "PreexistingVirtualEnv", I will ponder.
   
   https://github.com/apache/airflow/pull/25780#issuecomment-1221510678 -> this contains explanation why I lioe PreexistingVirtualenv -> In short, while we know it will work with "non venv", I believe we really want to promote it as such and we want to make our users map their past knowledge about Virtualenv, and also it is a perfect counterpart to Virtualenv operator for the scenario of "Dag developing" wich I also explained in the doc.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1218689029

   I still have to add /fix tests. But for some strange reason I had to make a number of changes to our typing (MyPy complained) - those changes look rather reasonable but @uranusjr maybe you can take a look If I have not made some stupid mistake that led to it. 
   
   BTW. PythonExternalOperator seems like a good name overall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1219659591

   I think `ExternalPythonOperator` is a confusing name , that could give the impression to users that the task is going to run out of airflow itself ( it sounds stupid and magical , but so many airflow users are not aware of how things really works and where code is actually running depending the operator and executor )
   
   Why not add a parameter to PythonVirtualenvOperator giving the possibility to the user to set the path of an existing venv ?
   
   or name this new operator `PythonExistingVirtualenvOperator` ?
   
   thanks
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r949425679


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -93,3 +93,28 @@ def callable_virtualenv():
 
         virtualenv_task = callable_virtualenv()
         # [END howto_operator_python_venv]
+
+        # [START howto_operator_external_python]
+        @task.external_python(task_id="virtualenv_python", python="/ven/bin/python")

Review Comment:
   Super-minor-nit: s/ven/venv/g



##########
airflow/example_dags/example_python_operator.py:
##########
@@ -93,3 +93,28 @@ def callable_virtualenv():
 
         virtualenv_task = callable_virtualenv()
         # [END howto_operator_python_venv]
+
+        # [START howto_operator_external_python]
+        @task.external_python(task_id="virtualenv_python", python="/ven/bin/python")
+        def callable_external_python():
+            """
+            Example function that will be performed in a virtual environment.
+
+            Importing at the module level ensures that it will not attempt to import the
+            library before it is installed.
+            """
+            from time import sleep
+
+            from colorama import Back, Fore, Style
+
+            print(Fore.RED + 'some red text')
+            print(Back.GREEN + 'and with a green background')
+            print(Style.DIM + 'and in dim text')
+            print(Style.RESET_ALL)
+            for _ in range(10):
+                print(Style.DIM + 'Please wait...', flush=True)
+                sleep(10)

Review Comment:
   I know this is just copy/pasted from above, but 100s seems very long for a task in a simple example dag.
   
   (Also speaking of copy/paste, maybe extract this into a helper so the code isn't duplicated in both tasks).



##########
docs/apache-airflow/tutorial_taskflow_api.rst:
##########
@@ -253,17 +253,21 @@ It is worth noting that the Python source code (extracted from the decorated fun
     You should upgrade to Airflow 2.2 or above in order to use it.
 
 If you don't want to run your image on a Docker environment, and instead want to create a separate virtual
-environment on the same machine, you can use the ``@task.virtualenv`` decorator instead. The ``@task.virtualenv``
-decorator will allow you to create a new virtualenv with custom libraries and even a different
-Python version to run your function.
+environment on the same machine, you can use the ``@task.virtualenv`` decorator or ``@task.external_python``
+instead. The ``@task.virtualenv`` decorator will allow you to create a new virtualenv with custom libraries
+and even a different Python version to run your function, similarly ``@task.external_python`` will allow you
+to run Airflow task in pre-defined, immutable virtualenv (which also could have different set of custom

Review Comment:
   ```suggestion
   to run an Airflow task in pre-defined, immutable virtualenv (which also could have a different set of custom
   ```



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:ExternalPythonOperator:
+
+ExternalPythonOperator
+======================
+
+The ExternalPythonOperator can help you to run some of your tasks with different set of Python

Review Comment:
   ```suggestion
   The ExternalPythonOperator can help you to run some of your tasks with a different set of Python
   ```



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:ExternalPythonOperator:
+
+ExternalPythonOperator
+======================
+
+The ExternalPythonOperator can help you to run some of your tasks with different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.ExternalPythonOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_external_python]
+    :end-before: [END howto_operator_external_python]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it in the PythonOperator.

Review Comment:
   ```suggestion
   You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use them in the PythonOperator.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r949024879


##########
airflow/decorators/base.py:
##########
@@ -165,7 +166,7 @@ def __init__(
         python_callable: Callable,
         task_id: str,
         op_args: Optional[Collection[Any]] = None,
-        op_kwargs: Optional[Mapping[str, Any]] = None,
+        op_kwargs: Optional[MutableMapping[str, Any]] = None,

Review Comment:
   Right. Fixed by copying ok_kwargs before mutating it .



##########
airflow/decorators/base.py:
##########
@@ -165,7 +166,7 @@ def __init__(
         python_callable: Callable,
         task_id: str,
         op_args: Optional[Collection[Any]] = None,
-        op_kwargs: Optional[Mapping[str, Any]] = None,
+        op_kwargs: Optional[MutableMapping[str, Any]] = None,

Review Comment:
   Right. Fixed by copying op_kwargs before mutating it .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mik-laj commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
mik-laj commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r961924081


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -32,6 +36,20 @@
 
 log = logging.getLogger(__name__)
 
+PYTHON = sys.executable
+
+BASE_DIR = tempfile.gettempdir()
+EXTERNAL_PYTHON_ENV = Path(BASE_DIR, "venv-for-system-tests")
+
+# [START howto_initial_operator_external_python]
+
+EXTERNAL_PYTHON_PATH = EXTERNAL_PYTHON_ENV / "bin" / "python"
+
+# [END howto_initial_operator_external_python]
+
+if not EXTERNAL_PYTHON_PATH.exists():
+    venv.create(EXTERNAL_PYTHON_ENV)

Review Comment:
   This means that the virtual environment will also be created when the user only executes the command.
   ```
   AIRFLOW__CORE__LOAD_EXAMPLES=true airflow dags list
   ```
   I don't think this is expected and we should run it as a separate task.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r961969631


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -32,6 +36,20 @@
 
 log = logging.getLogger(__name__)
 
+PYTHON = sys.executable
+
+BASE_DIR = tempfile.gettempdir()
+EXTERNAL_PYTHON_ENV = Path(BASE_DIR, "venv-for-system-tests")
+
+# [START howto_initial_operator_external_python]
+
+EXTERNAL_PYTHON_PATH = EXTERNAL_PYTHON_ENV / "bin" / "python"
+
+# [END howto_initial_operator_external_python]
+
+if not EXTERNAL_PYTHON_PATH.exists():
+    venv.create(EXTERNAL_PYTHON_ENV)

Review Comment:
   Good point
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1228889456

   I think I addressed all comments (thanks for VERY THOROUGH docs review @o-nikolas and @ashb ). 
   
   I think I figured out a good proposal that fits very well both - requirements of being similar to virtualenv and being "correct" in terms of not having to use virtualenv.
   
   My proposal (and it's already updated in the PR) is:
   
   * PythonOtherenvOperator
   * @task.otherenv decorator
   
   I think this addresses all the concerns, it is short, easy to remember and use and also has very close resemblance to PythonVirtualenvOperator to show that it is closer to it than to PythonOperator and it does not imply Virtualenv.
   
   Few doubts I had (and I made some choices that could be changed still):
   
   * PythonOtherEnvOperator vs PythonOtherenvOperator -> I think the latter is better even if slightly less "correct" - it also matches well the decorator (we have no casing in decorator by convention)
   * @task.python_otherenv vs. @task.otherenv  -> I think the latter is better: shorter and more close to @task.virtualenv too.
   * still there is one reference to virtualenv - we have ``virtualenv_string_args`` still created as 'global' variable accessible in the task. Changing it would be backwards-incompatible, and I think it's not worth to handle it differently.
   
   Let me know what you think @o-nikolas @uranusjr @ashb. Does it look `good-enough` for all of you :)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r955896696


##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it in the PythonOperator.

Review Comment:
   ```suggestion
   You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it with the PythonOperator.
   ```



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same

Review Comment:
   ```suggestion
   Python is run and in case ``dill`` is used, it has to be installed in the virtualenv (the same
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c

Review Comment:
   ```suggestion
     same worker might be affected by previous tasks creating/modifying files etc.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.

Review Comment:
   What resources? I don't understand that final bit.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.

Review Comment:
   ```suggestion
     be added at task run time. This is good for both security and stability.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PythonPreexistingVirtualenvOperator``
+as counterparts - as a DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators) while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. Those require that Airflow runs in a
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those).
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+However, it is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+a very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.

Review Comment:
   ```suggestion
   Similarly as in case of Python operators, the taskflow decorators ``@task.docker`` and ``@task.kubernets``` are handy for you if you would like to
   use those operators to execute your callable Python code.
   
   However, it is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
   this approach, but the tasks are fully isolated from each other and you are not even limited to running
   Python code. You can write your tasks in any Programming language you want. Also your dependencies are
   fully independent from Airflow ones (including the system level dependencies) so if your task require
   a very different environment, this is the way to go.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PythonPreexistingVirtualenvOperator``
+as counterparts - as a DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators) while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. Those require that Airflow runs in a
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those).
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+However, it is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+a very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:

Review Comment:
   ```suggestion
   The benefits of these operators are:
   ```



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it in the PythonOperator.
+Unfortunately we currently do not support to serialize ``var`` and ``ti`` / ``task_instance`` due to incompatibilities
+with the underlying library. For Airflow context variables make sure that Airflow is also installed as part

Review Comment:
   What does this mean? I can't quite understand it.



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it in the PythonOperator.
+Unfortunately we currently do not support to serialize ``var`` and ``ti`` / ``task_instance`` due to incompatibilities
+with the underlying library. For Airflow context variables make sure that Airflow is also installed as part
+of the virtualenv environment in the same version as the Airflow version the task is run on.
+Otherwise you won't have access to the most context variables of Airflow in ``op_kwargs``.
+If you want the context related to datetime objects like ``data_interval_start`` you can add ``pendulum`` and
+``lazy_object_proxy`` to your virtualenv.

Review Comment:
   Hmmm, what is lazy_object_proxy needed for? This one feels like it shouldn't be required for "most" users.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,

Review Comment:
   I don't like this framing -- it paints pre-packaged operators as legacy and to-be-avoided, which they are not. Please reword this.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.

Review Comment:
   ```suggestion
   and the dependencies conflict between those tasks.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).

Review Comment:
   I don't think the point here about memory being reused is true -- since it's a new process each new venv has it's own copy of python and modules loaded -- nothing is shared.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added

Review Comment:
   ```suggestion
   you cannot add new dependencies from the task to such a pre-existing virtualenv. All dependencies you need should be added
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the

Review Comment:
   ```suggestion
   A bit more complex but with significantly less overhead, security, and stability problems is to use the
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements

Review Comment:
   ```suggestion
   * No need to learn more about containers or Kubernetes as a DAG Author. Only knowledge of Python and requirements
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c

Review Comment:
   ```suggestion
     same worker might be affected by previous tasks creating/modifying files etc.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.

Review Comment:
   ```suggestion
   * There is no need to have access by workers to PyPI or private repositories; less chance for transient
     errors resulting from networking glitches.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PythonPreexistingVirtualenvOperator``

Review Comment:
   ```suggestion
   You can think about the ``PythonVirtualenvOperator`` and ``PythonPreexistingVirtualenvOperator``
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PythonPreexistingVirtualenvOperator``
+as counterparts - as a DAG author you'd normally iterate with dependencies and develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with ``@task.virtualenv`` decorators) while
+after the iteration and changes you would likely want to change it for production to switch to
+the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the decorator back
+at any time and continue developing it "dynamically" with ``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. Those require that Airflow runs in a
+Docker container environment or Kubernetes environment (or at the very least have access to create and
+run tasks with those).
+
+Similarly as in case of Python operators, the taskflow decorators are handy for you if you would like to
+use those operators to execute your callable Python code.
+
+However, it is far more involved - you need to understand how Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are not even limited to running
+Python code. You can write your tasks in any Programming language you want. Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) so if your task require
+a very different environment, this is the way to go. Those are ``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level dependencies, or even tasks
+  written in completely different language or even different processor architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and immutability of containers, where a
+  similar set of dependencies can effectively reuse a number of cached layers of the image, so the
+  environment is optimized for the case where you have multiple similar, but different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.

Review Comment:
   ```suggestion
     be added at task runtime. This is good for both security and stability.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1222352127

   Rebased after #25864 @uranusjr 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951846264


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -20,10 +20,11 @@
 Best Practices
 ==============
 
-Creating a new DAG is a two-step process:
+Creating a new DAG is a three-step process:
 
 - writing Python code to create a DAG object,
-- testing if the code meets our expectations
+- testing if the code meets our expectations,
+- running the DAG in production

Review Comment:
   Yeah. It did not feel right to me either :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956388525


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).

Review Comment:
   Right. It was a mental shortuct. I meant that you do not have to run multiple workers to handle multiple environments. So memory saving was from not essentially duplicating memory for running more workers. I reworded it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963008083


##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.
+    :param templates_dict: a dictionary where the values are templates that
+        will get templated by the Airflow engine sometime between
+        ``__init__`` and ``execute`` takes place and are made available
+        in your callable's context after the template has been applied
+    :param templates_exts: a list of file extensions to resolve while
+        processing templated fields, for examples ``['.sql', '.hql']``
+    """
+
+    template_fields: Sequence[str] = tuple({'python_path'} | set(PythonOperator.template_fields))
+
+    def __init__(
+        self,
+        *,
+        python: str,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[Mapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if not python:
+            raise ValueError("Python Path must be defined in ExternalPythonOperator")
+        self.python = python
+        super().__init__(
+            python_callable=python_callable,
+            use_dill=use_dill,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            string_args=string_args,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+
+    def execute_callable(self):
+        python_path = Path(self.python)
+        if not python_path.exists():
+            raise ValueError(f"Python Path '{python_path}' must exists")
+        if not python_path.is_file():
+            raise ValueError(f"Python Path '{python_path}' must be a file")
+        if not python_path.is_absolute():
+            raise ValueError(f"Python Path '{python_path}' must be an absolute path.")
+        python_version_as_list_of_strings = self._get_python_version_from_venv()
+        if (
+            python_version_as_list_of_strings
+            and str(python_version_as_list_of_strings[0]) != str(sys.version_info.major)
+            and (self.op_args or self.op_kwargs)
+        ):
+            raise AirflowException(
+                "Passing op_args or op_kwargs is not supported across different Python "
+                "major versions for ExternalPythonOperator. Please use string_args."
+                f"Sys version: {sys.version_info}. Venv version: {python_version_as_list_of_strings}"
+            )
+        with TemporaryDirectory(prefix='tmd') as tmp_dir:
+            tmp_path = Path(tmp_dir)
+            return self._execute_python_callable_in_subprocess(python_path, tmp_path)
+
+    def _get_virtualenv_path(self) -> Path:
+        return Path(self.python).parents[1]
+
+    def _get_python_version_from_venv(self) -> List[str]:
+        try:
+            result = subprocess.check_output([self.python, "--version"], text=True)
+            return result.strip().split(" ")[-1].split(".")
+        except Exception as e:
+            raise ValueError(f"Error while executing {self.python}: {e}")
+
+    def _get_airflow_version_from_venv(self) -> Optional[str]:
+        try:
+            result = subprocess.check_output(
+                [self.python, "-c", "from airflow import version; print(version.version)"], text=True
+            )
+            venv_airflow_version = result.strip()
+            if venv_airflow_version != airflow_version:
+                raise AirflowConfigException(
+                    f"The version of Airflow installed in the virtualenv {self._get_virtualenv_path()}: "
+                    f"{venv_airflow_version} is different than the runtime Airflow version: "
+                    f"{airflow_version}. Make sure your environment has the same Airflow version "
+                    f"installed as the Airflow runtime."
                 )
-                raise
+            return venv_airflow_version
+        except Exception as e:
+            self.log.info("When checking for Airflow installed in venv got %s", e)
+            self.log.info(
+                f"This means that Airflow is not properly installed in the virtualenv "
+                f"{self._get_virtualenv_path()}. Airflow context keys will not be available. "
+                f"Please Install Airflow {airflow_version} in your venv to access them."
+            )

Review Comment:
   Good point. Actually DBT was one of the main reasons one of my customers is so much excited about it. I will change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963061072


##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.
+    :param templates_dict: a dictionary where the values are templates that
+        will get templated by the Airflow engine sometime between
+        ``__init__`` and ``execute`` takes place and are made available
+        in your callable's context after the template has been applied
+    :param templates_exts: a list of file extensions to resolve while
+        processing templated fields, for examples ``['.sql', '.hql']``
+    """
+
+    template_fields: Sequence[str] = tuple({'python_path'} | set(PythonOperator.template_fields))
+
+    def __init__(
+        self,
+        *,
+        python: str,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[Mapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if not python:
+            raise ValueError("Python Path must be defined in ExternalPythonOperator")
+        self.python = python
+        super().__init__(
+            python_callable=python_callable,
+            use_dill=use_dill,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            string_args=string_args,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+
+    def execute_callable(self):
+        python_path = Path(self.python)
+        if not python_path.exists():
+            raise ValueError(f"Python Path '{python_path}' must exists")
+        if not python_path.is_file():
+            raise ValueError(f"Python Path '{python_path}' must be a file")
+        if not python_path.is_absolute():
+            raise ValueError(f"Python Path '{python_path}' must be an absolute path.")
+        python_version_as_list_of_strings = self._get_python_version_from_venv()
+        if (
+            python_version_as_list_of_strings
+            and str(python_version_as_list_of_strings[0]) != str(sys.version_info.major)
+            and (self.op_args or self.op_kwargs)
+        ):
+            raise AirflowException(
+                "Passing op_args or op_kwargs is not supported across different Python "
+                "major versions for ExternalPythonOperator. Please use string_args."
+                f"Sys version: {sys.version_info}. Venv version: {python_version_as_list_of_strings}"
+            )
+        with TemporaryDirectory(prefix='tmd') as tmp_dir:
+            tmp_path = Path(tmp_dir)
+            return self._execute_python_callable_in_subprocess(python_path, tmp_path)
+
+    def _get_virtualenv_path(self) -> Path:
+        return Path(self.python).parents[1]
+
+    def _get_python_version_from_venv(self) -> List[str]:
+        try:
+            result = subprocess.check_output([self.python, "--version"], text=True)
+            return result.strip().split(" ")[-1].split(".")
+        except Exception as e:
+            raise ValueError(f"Error while executing {self.python}: {e}")
+
+    def _get_airflow_version_from_venv(self) -> Optional[str]:
+        try:
+            result = subprocess.check_output(
+                [self.python, "-c", "from airflow import version; print(version.version)"], text=True
+            )
+            venv_airflow_version = result.strip()
+            if venv_airflow_version != airflow_version:
+                raise AirflowConfigException(
+                    f"The version of Airflow installed in the virtualenv {self._get_virtualenv_path()}: "
+                    f"{venv_airflow_version} is different than the runtime Airflow version: "
+                    f"{airflow_version}. Make sure your environment has the same Airflow version "
+                    f"installed as the Airflow runtime."
                 )
-                raise
+            return venv_airflow_version
+        except Exception as e:
+            self.log.info("When checking for Airflow installed in venv got %s", e)
+            self.log.info(
+                f"This means that Airflow is not properly installed in the virtualenv "
+                f"{self._get_virtualenv_path()}. Airflow context keys will not be available. "
+                f"Please Install Airflow {airflow_version} in your venv to access them."
+            )

Review Comment:
   The "expect_airflow" tunre out  also to be  useful for PyhonVirtualenv, because then it will skip attempting to import airflow plugins if not set (i slightly modified the script template). That seems to be more consistent (and I left default value to be true in both, but users can disable it) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1219187952

   > The main implementation looks fine to me in general but there are too much peripheral changes (mapping changed to mutablemapping etc.) that don’t need to happen.
   
   Yeah. for some reason when I split - MyPy started to complain on those Mapping to be non Mutable - I just scraped it quickly and simply fixed MyPy, but yeah - I agree I need to find out why MyPy started to complain in the first place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950763784


##########
airflow/providers/docker/decorators/docker.py:
##########
@@ -27,7 +27,8 @@
 
 from airflow.decorators.base import DecoratedOperator, task_decorator_factory
 from airflow.providers.docker.operators.docker import DockerOperator
-from airflow.utils.python_virtualenv import remove_task_decorator, write_python_script
+from airflow.utils.decorators import remove_task_decorator

Review Comment:
   Ah. i see now - it's because of back-compat. for 2.2 and 2.3. It's fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on pull request #25780: Implement PythonOtherenvOperator

Posted by GitBox <gi...@apache.org>.
eladkal commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1229241063

   > I am thinking that we should start voting at the devlist for that one :).
   
   In the `DummyOperator`/`EmptyOperator`/other names kakil raised an unofficial pool in linkedin/twitter/slack just to see what the community thinks.
   
   What about `PythonConnectEnvOperator` / `PythonUseEnvOperator` / `PythonRunEnvOperator`? It suggests that the Env is already defined elsewhere and user just utilize it.
   
   (Maybe it will have for the distinction if we also rename the current `PythonVirtualenvOperator` to `PythonCreateVirtualenvOperator`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956377404


##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it in the PythonOperator.
+Unfortunately we currently do not support to serialize ``var`` and ``ti`` / ``task_instance`` due to incompatibilities
+with the underlying library. For Airflow context variables make sure that Airflow is also installed as part
+of the virtualenv environment in the same version as the Airflow version the task is run on.
+Otherwise you won't have access to the most context variables of Airflow in ``op_kwargs``.
+If you want the context related to datetime objects like ``data_interval_start`` you can add ``pendulum`` and
+``lazy_object_proxy`` to your virtualenv.

Review Comment:
   It is used to get Context resolved properly I believe. This is taken directly from virtualenv documentation and I think it still holds (@uranusjr ?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956380794


##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed pass them in ``requ
 All supported options are listed in the `requirements file format <https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use it in the PythonOperator.
+Unfortunately we currently do not support to serialize ``var`` and ``ti`` / ``task_instance`` due to incompatibilities
+with the underlying library. For Airflow context variables make sure that Airflow is also installed as part

Review Comment:
   It's directly copied from the existing Python Virtualenv description but I can rephrase it in both places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951796105


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different

Review Comment:
   ```suggestion
   who manage Airflow installation need to be involved (and in bigger installations those are usually different
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with

Review Comment:
   ```suggestion
   which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have

Review Comment:
   ```suggestion
   independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements

Review Comment:
   ```suggestion
   * No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python, requirements
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and develop your DAG using

Review Comment:
   ```suggestion
   as counterparts - as a DAG author you'd normally iterate with dependencies and develop your DAG using
   ```



##########
airflow/decorators/__init__.pyi:
##########
@@ -123,6 +125,37 @@ class TaskDecoratorCollection:
         """
     @overload
     def virtualenv(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...
+    def preexisting_virtualenv(
+        self,
+        *,
+        python: str,
+        multiple_outputs: Optional[bool] = None,
+        # 'python_callable', 'op_args' and 'op_kwargs' since they are filled by
+        # _PythonVirtualenvDecoratedOperator.
+        use_dill: bool = False,
+        templates_dict: Optional[Mapping[str, Any]] = None,
+        show_return_value_in_logs: bool = True,
+        **kwargs,
+    ) -> TaskDecorator:
+        """Create a decorator to convert the decorated callable to a virtual environment task.
+
+        :param python: Full time path string (file-system specific) that points to a Python binary inside

Review Comment:
   ```suggestion
           :param python: Full path string (file-system specific) that points to a Python binary inside
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements

Review Comment:
   ```suggestion
   * No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:

Review Comment:
   ```suggestion
   There are certain limitations and overhead introduced by this operator:
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you

Review Comment:
   ```suggestion
   * All dependencies that are not available in the Airflow environment must be locally imported in the callable you
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository

Review Comment:
   ```suggestion
     or when there is a networking issue with reaching the repository)
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:

Review Comment:
   ```suggestion
   The operator takes care of:
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.

Review Comment:
   This is a very important point!



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your

Review Comment:
   What do you specifically mean by "It requires however that the virtualenv you use is immutable by the task"?



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the

Review Comment:
   ```suggestion
   Airflow runs in a distributed environment). This way you avoid the overhead and problems of re-creating the
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -20,10 +20,11 @@
 Best Practices
 ==============
 
-Creating a new DAG is a two-step process:
+Creating a new DAG is a three-step process:
 
 - writing Python code to create a DAG object,
-- testing if the code meets our expectations
+- testing if the code meets our expectations,
+- running the DAG in production
 
 This tutorial will introduce you to the best practices for these two steps.

Review Comment:
   ```suggestion
   This tutorial will introduce you to the best practices for these three steps.
   ```
   
   Are there other spots in the text below as well that need to be adapted now that there are three points?



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and ``PreexistingPythonVirtualenvOperator``

Review Comment:
   ```suggestion
   Actually, you can think about the ``PythonVirtualenvOperator`` and ``PythonPreexistingVirtualenvOperator``
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.

Review Comment:
   ```suggestion
   ``PythonPreexistingVirtualenvOperator``.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -20,10 +20,11 @@
 Best Practices
 ==============
 
-Creating a new DAG is a two-step process:
+Creating a new DAG is a three-step process:
 
 - writing Python code to create a DAG object,
-- testing if the code meets our expectations
+- testing if the code meets our expectations,
+- running the DAG in production

Review Comment:
   Something about "running in prod" feels strange to me (does your dag ever have to be in prod to be completely created?)
   
   Maybe generalize this to `- configuring environment dependencies to run your DAG`



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those

Review Comment:
   ```suggestion
   you might get to the point where the dependencies required by the custom code of yours are conflicting with those
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,

Review Comment:
   ```suggestion
   your operators are written using custom python code, or when you want to write your own Custom Operator,
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the

Review Comment:
   ```suggestion
     As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. This usually means that you
+  cannot change it on the flight, adding new or changing requirements require at least airflow re-deployment

Review Comment:
   ```suggestion
     cannot change it on the fly, adding new or changing requirements require at least an Airflow re-deployment
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator

Review Comment:
   ```suggestion
   Using PythonPreexistingVirtualenvOperator
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need

Review Comment:
   ```suggestion
   have its own independent Python virtualenv and can specify fine-grained set of requirements that need
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or even better - decorating your callable with

Review Comment:
   ```suggestion
   :class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
   ```
   I'm starting to question whether I have it wrong now :smile: 



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the modern

Review Comment:
   ```suggestion
   create a virtualenv that your Python callable function will execute in. In the modern
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1224366344

   > Looked through a bit more of the code this time around, just a few minor nits
   
   Thanks! I thought,  I am perfectionist :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948601230


##########
airflow/decorators/__init__.pyi:
##########
@@ -124,6 +126,40 @@ class TaskDecoratorCollection:
     @overload
     def virtualenv(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...
     @overload
+    def external_python(
+        self,
+        *,
+        python_fspath: str = None,

Review Comment:
   I wonder if we should just call this argument `python`. Also this probably should not have `= None`.



##########
airflow/decorators/__init__.pyi:
##########
@@ -124,6 +126,40 @@ class TaskDecoratorCollection:
     @overload
     def virtualenv(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...
     @overload
+    def external_python(
+        self,
+        *,
+        python_fspath: str = None,
+        multiple_outputs: Optional[bool] = None,
+        # 'python_callable', 'op_args' and 'op_kwargs' since they are filled by
+        # _PythonVirtualenvDecoratedOperator.
+        use_dill: bool = False,
+        templates_dict: Optional[Mapping[str, Any]] = None,
+        show_return_value_in_logs: bool = True,
+        **kwargs,
+    ) -> TaskDecorator:
+        """Create a decorator to convert the decorated callable to a virtual environment task.
+
+        :param python_fspath: Full time path string (file-system specific) that points to a Python binary inside
+            a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+            (so usually start with "/" or "X:/" depending on the filesystem/os used).
+        :param multiple_outputs: If set, function return value will be unrolled to multiple XCom values.
+            Dict will unroll to XCom values with keys as XCom keys. Defaults to False.
+        :param use_dill: Whether to use dill to serialize
+            the args and result (pickle is default). This allow more complex types
+            but requires you to include dill in your requirements.
+        :param templates_dict: a dictionary where the values are templates that
+            will get templated by the Airflow engine sometime between
+            ``__init__`` and ``execute`` takes place and are made available
+            in your callable's context after the template has been applied.
+        :param show_return_value_in_logs: a bool value whether to show return_value
+            logs. Defaults to True, which allows return value log output.
+            It can be set to False to prevent log output of return value when you return huge data
+            such as transmission a large amount of XCom to TaskAPI.
+        """
+    @overload
+    def external_python(self, python_callable: Callable[FParams, FReturn]) -> Task[FParams, FReturn]: ...

Review Comment:
   And this can be removed if the Python executable argument can’t be None.



##########
airflow/decorators/base.py:
##########
@@ -165,7 +166,7 @@ def __init__(
         python_callable: Callable,
         task_id: str,
         op_args: Optional[Collection[Any]] = None,
-        op_kwargs: Optional[Mapping[str, Any]] = None,
+        op_kwargs: Optional[MutableMapping[str, Any]] = None,

Review Comment:
   I don’t think this needs to be mutable?



##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   This does not need to change either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948537177


##########
airflow/operators/python.py:
##########
@@ -281,7 +283,143 @@ def execute(self, context: Context) -> Any:
         self.log.info("Done.")
 
 
-class PythonVirtualenvOperator(PythonOperator):
+class _BasePythonVirtualenvOperator(PythonOperator):
+    BASE_SERIALIZABLE_CONTEXT_KEYS = {
+        'ds',
+        'ds_nodash',
+        'inlets',
+        'next_ds',
+        'next_ds_nodash',
+        'outlets',
+        'prev_ds',
+        'prev_ds_nodash',
+        'run_id',
+        'task_instance_key_str',
+        'test_mode',
+        'tomorrow_ds',
+        'tomorrow_ds_nodash',
+        'ts',
+        'ts_nodash',
+        'ts_nodash_with_tz',
+        'yesterday_ds',
+        'yesterday_ds_nodash',
+    }
+    PENDULUM_SERIALIZABLE_CONTEXT_KEYS = {
+        'data_interval_end',
+        'data_interval_start',
+        'execution_date',
+        'logical_date',
+        'next_execution_date',
+        'prev_data_interval_end_success',
+        'prev_data_interval_start_success',
+        'prev_execution_date',
+        'prev_execution_date_success',
+        'prev_start_date_success',
+    }
+    AIRFLOW_SERIALIZABLE_CONTEXT_KEYS = {'macros', 'conf', 'dag', 'dag_run', 'task', 'params'}
+
+    def __init__(
+        self,
+        *,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[MutableMapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if (
+            not isinstance(python_callable, types.FunctionType)
+            or isinstance(python_callable, types.LambdaType)
+            and python_callable.__name__ == "<lambda>"
+        ):
+            raise AirflowException('PythonVirtualenvOperator only supports functions for python_callable arg')
+        super().__init__(
+            python_callable=python_callable,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+        self.string_args = string_args or []
+        self.use_dill = use_dill
+        self.pickling_library = dill if self.use_dill else pickle
+
+    def execute(self, context: Context) -> Any:
+        serializable_keys = set(self._iter_serializable_context_keys())
+        serializable_context = context_copy_partial(context, serializable_keys)
+        return super().execute(context=serializable_context)
+
+    def get_python_source(self):
+        """
+        Returns the source of self.python_callable
+        @return:
+        """
+        return dedent(inspect.getsource(self.python_callable))
+
+    def _write_args(self, file: Path):
+        if self.op_args or self.op_kwargs:
+            file.write_bytes(self.pickling_library.dumps({'args': self.op_args, 'kwargs': self.op_kwargs}))
+
+    def _iter_serializable_context_keys(self):
+        yield from self.BASE_SERIALIZABLE_CONTEXT_KEYS
+
+    def _write_string_args(self, file: Path):
+        file.write_text('\n'.join(map(str, self.string_args)))
+
+    def _read_result(self, path: Path):
+        if path.stat().st_size == 0:
+            return None
+        try:
+            return self.pickling_library.loads(path.read_bytes())
+        except ValueError:
+            self.log.error(
+                "Error deserializing result. Note that result deserialization "
+                "is not supported across major Python versions."
+            )
+            raise
+
+    def __deepcopy__(self, memo):
+        # module objects can't be copied _at all__
+        memo[id(self.pickling_library)] = self.pickling_library
+        return super().__deepcopy__(memo)
+
+    def _execute_python_callable_in_subprocess(self, python_path: Path, tmp_dir: Path):
+        if self.templates_dict:
+            self.op_kwargs['templates_dict'] = self.templates_dict
+        input_path = tmp_dir / 'script.in'

Review Comment:
   I  converted to Pathlib 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r955890728


##########
airflow/decorators/__init__.pyi:
##########
@@ -41,6 +42,7 @@ __all__ = [
     "task_group",
     "python_task",
     "virtualenv_task",
+    "preexisting_virtualenv_task",

Review Comment:
   Minor English language nit: while "pre-existing virtual env" is perfectly understandable, the "pre" prefix is unnecessary, "existing_virtualenv_task" has the same meaning and is slightly shorter. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r955910133


##########
tests/utils/test_preexisting_python_virtualenv_decorator.py:
##########
@@ -0,0 +1,52 @@
+#

Review Comment:
   Shouldn't this file be `tests/utils/test_decorators.py` to match the module name it's testing?



##########
tests/utils/test_preexisting_python_virtualenv_decorator.py:
##########
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+import unittest
+
+from airflow.utils.decorators import remove_task_decorator
+
+
+class TestPreexistingPythonVirtualenvDecorator(unittest.TestCase):
+    def test_remove_task_decorator(self):
+        py_source = "@task.preexisting_virtualenv(use_dill=True)\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "def f():\nimport funcsigs"
+
+    def test_remove_decorator_no_parens(self):
+
+        py_source = "@task.preexisting_virtualenv\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "def f():\nimport funcsigs"
+
+    def test_remove_decorator_nested(self):
+
+        py_source = "@foo\n@task.preexisting_virtualenv\n@bar\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "@foo\n@bar\ndef f():\nimport funcsigs"
+
+        py_source = "@foo\n@task.preexisting_virtualenv()\n@bar\ndef f():\nimport funcsigs"
+        res = remove_task_decorator(
+            python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
+        )
+        assert res == "@foo\n@bar\ndef f():\nimport funcsigs"

Review Comment:
   No need for a class here, can just use bare test functions with pytest
   
   ```suggestion
   from airflow.utils.decorators import remove_task_decorator
   
   
   def test_remove_task_decorator(self):
       py_source = "@task.preexisting_virtualenv(use_dill=True)\ndef f():\nimport funcsigs"
       res = remove_task_decorator(
           python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
       )
       assert res == "def f():\nimport funcsigs"
   
   def test_remove_decorator_no_parens(self):
   
       py_source = "@task.preexisting_virtualenv\ndef f():\nimport funcsigs"
       res = remove_task_decorator(
           python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
       )
       assert res == "def f():\nimport funcsigs"
   
   def test_remove_decorator_nested(self):
   
       py_source = "@foo\n@task.preexisting_virtualenv\n@bar\ndef f():\nimport funcsigs"
       res = remove_task_decorator(
           python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
       )
       assert res == "@foo\n@bar\ndef f():\nimport funcsigs"
   
       py_source = "@foo\n@task.preexisting_virtualenv()\n@bar\ndef f():\nimport funcsigs"
       res = remove_task_decorator(
           python_source=py_source, task_decorator_name="@task.preexisting_virtualenv"
       )
       assert res == "@foo\n@bar\ndef f():\nimport funcsigs"
   
   ```
   
   (Or if you don't like that just remove the `unittest.Testcase` base class -- that isn't needed and should be avoided.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956393472


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,
+you might get to the point where the dependencies required by the custom code of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some limits and overhead), and
+we will gradually go through those strategies that requires some changes in your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and
+  removed after it is finished, so there is nothing special (except having virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those libraries.
+* The virtual environments are run in the same operating system, so they cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might end up with the situation where
+  your task will stop working because someone released a new version of a dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might become malicious
+* The tasks are only isolated from each other via running in different environments. This makes it possible
+  that running tasks will still interfere with each other - for example subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in :class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All dependencies you need should be added
+upfront in your environment (and available in all the workers in case your Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you start running a task.
+* You can run tasks with different sets of dependencies on the same workers - thus all resources are reused.

Review Comment:
   Same mental shortcut. If we do not have multiple envs inside same worker, if we want to have two envs  - we have to run two parallel workers (each worker assigned to  a different queue) running. Reworded.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234187770

   Yeah it will work , I'm just concerned about "encouraging" users to create `just one single image with multiple predefined envs` . Sometimes users create chaotic stack design because something work out of box :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963015658


##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.

Review Comment:
   They are still useful (and necessary) when you want to use different Python version than the one Airflow is run on. I belive even piclkled plain string will not deserilize nicely if you go down with Python versions (this is what I believe the original reasoning for having string args was and it stlll holds for ExtrenalPython operator. I'd leave it - even if it might be a bit confusing, it migh save some headaches. Also I think one of the benefits of having PythonExternalOperator is "productionizing" of PythonVirtualenvOperator. The latter can help the users to iterate and develop , where the former can allow to use the same tasks without the venv creation overhed in production. So having 1-1 feature parity between those two is important to be able to seamlessly switch between them.



##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.

Review Comment:
   They are still useful (and necessary as I understand) when you want to use different Python version than the one Airflow is run on. I belive even piclkled plain string will not deserilize nicely if you go down with Python versions (this is what I believe the original reasoning for having string args was and it stlll holds for ExtrenalPython operator. I'd leave it - even if it might be a bit confusing, it migh save some headaches. Also I think one of the benefits of having PythonExternalOperator is "productionizing" of PythonVirtualenvOperator. The latter can help the users to iterate and develop , where the former can allow to use the same tasks without the venv creation overhed in production. So having 1-1 feature parity between those two is important to be able to seamlessly switch between them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r952871169


##########
airflow/operators/python.py:
##########
@@ -501,27 +561,150 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class PythonPreexistingVirtualenvOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead necessary overhead to create the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:PythonPreexistingVirtualenvOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.
+    :param templates_dict: a dictionary where the values are templates that
+        will get templated by the Airflow engine sometime between
+        ``__init__`` and ``execute`` takes place and are made available
+        in your callable's context after the template has been applied
+    :param templates_exts: a list of file extensions to resolve while
+        processing templated fields, for examples ``['.sql', '.hql']``
+    """
+
+    template_fields: Sequence[str] = tuple({'python_path'} | set(PythonOperator.template_fields))
+
+    def __init__(
+        self,
+        *,
+        python: str,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[Mapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if not python:
+            raise ValueError("Python Path must be defined in PythonPreexistingVirtualenvOperator")
+        self.python_path = Path(python)
+        if not self.python_path.exists():
+            raise ValueError(f"Python Path '{self.python_path}' must exists")
+        if not self.python_path.is_file():
+            raise ValueError(f"Python Path '{self.python_path}' must be a file")
+        if not self.python_path.is_absolute():
+            raise ValueError(f"Python Path '{self.python_path}' must be an absolute path.")
+        super().__init__(
+            python_callable=python_callable,
+            use_dill=use_dill,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            string_args=string_args,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+
+    def execute_callable(self):
+        python_version_as_list_of_strings = self._get_python_version_from_venv()
+        if (
+            python_version_as_list_of_strings
+            and str(python_version_as_list_of_strings[0]) != str(sys.version_info.major)
+            and (self.op_args or self.op_kwargs)
+        ):
+            raise AirflowException(
+                "Passing op_args or op_kwargs is not supported across different Python "
+                "major versions for PythonVirtualenvOperator. Please use string_args."

Review Comment:
   ```suggestion
                   "major versions for PythonPreexistingVirtualenvOperator. Please use string_args."
   ```



##########
airflow/operators/python.py:
##########
@@ -501,27 +561,150 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class PythonPreexistingVirtualenvOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead necessary overhead to create the virtualenv (with certain caveats).

Review Comment:
   ```suggestion
       without the overhead of creating the virtualenv (with certain caveats).
   ```



##########
airflow/operators/python.py:
##########
@@ -501,27 +561,150 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class PythonPreexistingVirtualenvOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead necessary overhead to create the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:PythonPreexistingVirtualenvOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.
+    :param templates_dict: a dictionary where the values are templates that
+        will get templated by the Airflow engine sometime between
+        ``__init__`` and ``execute`` takes place and are made available
+        in your callable's context after the template has been applied
+    :param templates_exts: a list of file extensions to resolve while
+        processing templated fields, for examples ``['.sql', '.hql']``
+    """
+
+    template_fields: Sequence[str] = tuple({'python_path'} | set(PythonOperator.template_fields))
+
+    def __init__(
+        self,
+        *,
+        python: str,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[Mapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if not python:
+            raise ValueError("Python Path must be defined in PythonPreexistingVirtualenvOperator")
+        self.python_path = Path(python)
+        if not self.python_path.exists():
+            raise ValueError(f"Python Path '{self.python_path}' must exists")
+        if not self.python_path.is_file():
+            raise ValueError(f"Python Path '{self.python_path}' must be a file")
+        if not self.python_path.is_absolute():
+            raise ValueError(f"Python Path '{self.python_path}' must be an absolute path.")
+        super().__init__(
+            python_callable=python_callable,
+            use_dill=use_dill,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            string_args=string_args,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+
+    def execute_callable(self):
+        python_version_as_list_of_strings = self._get_python_version_from_venv()
+        if (
+            python_version_as_list_of_strings
+            and str(python_version_as_list_of_strings[0]) != str(sys.version_info.major)
+            and (self.op_args or self.op_kwargs)
+        ):
+            raise AirflowException(
+                "Passing op_args or op_kwargs is not supported across different Python "
+                "major versions for PythonVirtualenvOperator. Please use string_args."
+                f"Sys version: {sys.version_info}. Venv version: {python_version_as_list_of_strings}"
+            )
+        with TemporaryDirectory(prefix='tmd') as tmp_dir:
+            tmp_path = Path(tmp_dir)
+            return self._execute_python_callable_in_subprocess(self.python_path, tmp_path)
+
+    def _get_virtualenv_path(self) -> Path:
+        return self.python_path.parents[1]
+
+    def _get_python_version_from_venv(self) -> List[str]:
+        try:
+            result = subprocess.check_output([self.python_path, "--version"], text=True)
+            return result.strip().split(" ")[-1].split(".")
+        except Exception as e:
+            raise ValueError(f"Error while executing {self.python_path}: {e}")
+
+    def _get_airflow_version_from_venv(self) -> Optional[str]:
+        try:
+            result = subprocess.check_output(
+                [self.python_path, "-c", "from airflow import version; print(version.version)"], text=True
+            )
+            venv_airflow_version = result.strip()
+            if venv_airflow_version != airflow_version:
+                raise AirflowConfigException(
+                    f"The version of airflow installed in virtualenv {self._get_virtualenv_path()} is "
+                    f"different than runtime Airflow error: {airflow_version}. Make sure your venv"
+                    f" has the same airflow version installed as Airflow runtime."

Review Comment:
   ```suggestion
                       f"The version of Airflow installed in the virtualenv {self._get_virtualenv_path()}: {venv_airflow_version} is "
                       f"different than the runtime Airflow version: {airflow_version}. Make sure your venv"
                       f" has the same Airflow version installed as the Airflow runtime."
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948873289


##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   No. Actually the problem is that previously mypy has not detected the modifications we've done to op_kwargs.
   
   ```
           if self.templates_dict:
               op_kwargs['templates_dict'] = self.templates_dict
   ```
   
   is not possible for Mapping (Mapping is immutable). Seems it "accidentally" worked because our Mapping was really a Dict. But the right solution is to create a new mapping out of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1228553448

   > I'm also not 100% sold on "PreexistingVirtualEnv", I will ponder.
   
   BTW. I think the "preexising" part is one that I also do not 100% like (too long and easy to make typo). But I think keeping the relation to Virtualenv operator and the fact that they are very  closely related is important. So I am open for any other name that keeps those properties :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r956385787


##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade.  *Use with caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you can upgrade the providers
+independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write your own Custom Operator,

Review Comment:
   I changed it to "pythonic" from modern



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1234271552

   thanks for your answer , it's really clear :+1: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement PythonPreexistingVirtualenvOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1221510678

   > The problem with `PythonPreexistingVirtualenvOperator` is it ideologically misleads the user to think it can only reference a virtual environment, while functionally it can work against much more (for example, you can set `python="/usr/bin/python"` to use system packages installed by apt, even if Airflow is not installed against that Python interpreter). This leads to worse discoverability.
   
   Why I like the PythonPreexistingVirtualenOperator name @uranusjr  ? I thought a lot about it and it was clear to me that it is better name at the moment I attempted to write the documentation.
   
   I think it is really the matter of what we want to promote, not what it CAN do and how it refers to the currently existing operator names that people are used to and are using already. I think "external Python" is far less discoverable if you consider what we want our users to do.
   
   First of all I think we want to promote virtualenv usage. As you well know, this is something also `pip` maintainer promote heavily - that pretty much any serious use of multiple python environments should be via virtualenv. This has been a hot topic in discussion I was participating at `pip` and it si very clear that using "a python binary" which is not part of virtualenv  is generally dangerous. It's fine if you want to build your "base" image. But we are talking about multiple environments in one environment (whether it's container image or local installation) virtualenv is the way to go. While it is ok to have "base" python installation as just system-installed Python, when it comes to multiple environments set in the same env - setting up virtualenv for those is pretty much, the only option. And definitely one that we should promote to the users. The fact that you CAN do something does not mean that you SHOULD. And in this case I think even if the users can use any non-virtua
 lenv Python binary, I think they should not. And the name is a strong indication they should not.
   
   But there is another aspect as well. And I realised that when I attempted to write the documentation. If you look at t the "Best Practices" document I updated you will - I hope - immediately see why the name is better. Airflow users are already used to Python, PythonVirtualenv, Docker, Kubernetes.  And the PythonPreexistingVirtualenvOperator is so much closer to PythonVirtualenv than PythonOperator. For example the callable needs to be serializable, depending on the installed packages different context variables are passed, you can use `dill` or `pickle`.  This operator is ALMOST the same from the behaviour as PythonVirtualenvOperator. The only difference is the startup/overhead and the way how you either specify the dependencies or make them embedded.
   
   The users have already a mental model built for how they can use airflow and it's better if we fit-in the exisitng mental model rather than try to crate a completely new one. 
   
   And if you look at the Best Practices docment - it feels super-natural. I wrote a chapter about managing dependencies, where the user can naturally progress from Python through PythonVirtualenv, to PythonPreexisitngVirtualenv. And if you see the line of thoughts there - it gradually changes the pros/cons. And it fits very well there. 
   
   I think it fits so well, that it actually finally makes PythonVirtualenvOperator actually useful. This is something that @jedcunningham mentioned - that previously that operator was not very useful, and I agree. However when we introduce the new operator, it suddenly changes: 
   
   I wrote a separate chapter about it - so you can read it in the PR but out-of-the sudden PythonVirtualenvOperator + PythonPreexistingVirtualenvOperator combo is a super useful pair of operators. One of the big drawbacks of the PythonPreexistingVirtualenvOperator. is that you have to have the venv prepared for you. And often as a data scientist or even data engineer, you will not be able to add your venv to the image or local installation because it is managed by others. But you really want to iterate on your tasks with new dependencies. The PythonPreexistingVirtualenvOperator makes it super hard. But on the other hand PythonVirtualenvOperator makes it super easy (and you do not need to ask anyone for permission). And in the new taskflow world, this will be just a matter of replacing your @task.preexisting_virtualenv with @task_virtualenv decorator (with new requirements) and you can test your new dependencies without changing a single line of code in your callable. This is super p
 owerful and it is I think the most important reason why it should be super-obvious from the names of those two operators that they are almost the same.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950703457


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -93,3 +93,28 @@ def callable_virtualenv():
 
         virtualenv_task = callable_virtualenv()
         # [END howto_operator_python_venv]
+
+        # [START howto_operator_external_python]
+        @task.external_python(task_id="virtualenv_python", python="/ven/bin/python")

Review Comment:
   Yeah. I planned to change it to make system tests executable anyway :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r950209910


##########
airflow/providers/docker/decorators/docker.py:
##########
@@ -27,7 +27,8 @@
 
 from airflow.decorators.base import DecoratedOperator, task_decorator_factory
 from airflow.providers.docker.operators.docker import DockerOperator
-from airflow.utils.python_virtualenv import remove_task_decorator, write_python_script
+from airflow.utils.decorators import remove_task_decorator

Review Comment:
   (I.e. so this provider will work with 2.3 and 2.3 we'll need to try importing the new way, catch the error then import old way)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1219386764

   I pushed fixes (I still need to add more tests).  
   
   Unfortunately it seems that the `mypy` instabilitites are a bit different nature. As explained in the comment above for some reason it started to complain with those errors:
   
   ```
   airflow/operators/python.py:277: error: Argument 3 to "skip" of "SkipMixin" has
   incompatible type "Collection[Union[BaseOperator, MappedOperator]]"; expected
   "Sequence[BaseOperator]"  [arg-type]
                       self.skip(dag_run, execution_date, downstream_tasks)
                                                          ^
   airflow/operators/python.py:282: error: Argument 3 to "skip" of "SkipMixin" has
   incompatible type "Iterable[DAGNode]"; expected "Sequence[BaseOperator]"
   [arg-type]
       ...              self.skip(dag_run, execution_date, context["task"].get_d...
                                                           ^
   Found 2 errors in 1 file (checked 1 source file)
   ```
   
   I looked at it closely and I think the suggestions from MyPy were actually correct. I could not find any reason why get_direct_relatives should return DAGNode, as far as I can tell you cannot get TaskGroups  - you only get tasks so `Union[BaseOperator, MappedOperator]' (and you cannot skip TaskGroup either). Also Collection was not right, because tasks[0] was used in the 'skip' method:
   
   ```
    DagRun.dag_id == tasks[0].dag_id,
   ```
   
   So looks like somethign "masked" the problems from MyPy before and we should fix it here.  Any insights and confirmation of my findings would be appreciated before I add more tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948847654


##########
airflow/operators/python.py:
##########
@@ -176,7 +178,7 @@ def execute(self, context: Context) -> Any:
 
         return return_value
 
-    def determine_kwargs(self, context: Mapping[str, Any]) -> Mapping[str, Any]:
+    def determine_kwargs(self, context: MutableMapping[str, Any]) -> MutableMapping[str, Any]:

Review Comment:
   I think I know why MyPy was complaining. The problem was that if only .py file changed, Mypy did not really use .pyi file to determine the actual type - we have a number of cases where type in .pyi overrides the types and if it does not change and MyPy does not see the .pyi file it will not know the type is overridden in .pyi file when you run it incrementally. 
   
   I will add a small change to the mypy pre-commit to see if there are .pyi files corresponding to changed .py files and add them to the list of files if they are missing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r948536905


##########
airflow/operators/python.py:
##########
@@ -441,89 +531,122 @@ def execute_callable(self):
             with open(requirements_file_name, 'w') as file:
                 file.write(requirements_file_contents)
 
-            if self.templates_dict:
-                self.op_kwargs['templates_dict'] = self.templates_dict
-
-            input_filename = os.path.join(tmp_dir, 'script.in')
-            output_filename = os.path.join(tmp_dir, 'script.out')
-            string_args_filename = os.path.join(tmp_dir, 'string_args.txt')
-            script_filename = os.path.join(tmp_dir, 'script.py')
-
             prepare_virtualenv(
                 venv_directory=tmp_dir,
                 python_bin=f'python{self.python_version}' if self.python_version else None,
                 system_site_packages=self.system_site_packages,
                 requirements_file_path=requirements_file_name,
                 pip_install_options=self.pip_install_options,
             )
+            python_path = tmp_path / "bin" / "python"
 
-            self._write_args(input_filename)
-            self._write_string_args(string_args_filename)
-            write_python_script(
-                jinja_context=dict(
-                    op_args=self.op_args,
-                    op_kwargs=self.op_kwargs,
-                    pickling_library=self.pickling_library.__name__,
-                    python_callable=self.python_callable.__name__,
-                    python_callable_source=self.get_python_source(),
-                ),
-                filename=script_filename,
-                render_template_as_native_obj=self.dag.render_template_as_native_obj,
-            )
-
-            execute_in_subprocess(
-                cmd=[
-                    f'{tmp_dir}/bin/python',
-                    script_filename,
-                    input_filename,
-                    output_filename,
-                    string_args_filename,
-                ]
-            )
-
-            return self._read_result(output_filename)
-
-    def get_python_source(self):
-        """
-        Returns the source of self.python_callable
-        @return:
-        """
-        return dedent(inspect.getsource(self.python_callable))
-
-    def _write_args(self, filename):
-        if self.op_args or self.op_kwargs:
-            with open(filename, 'wb') as file:
-                self.pickling_library.dump({'args': self.op_args, 'kwargs': self.op_kwargs}, file)
+            return self._execute_python_callable_in_subprocess(python_path, tmp_path)
 
     def _iter_serializable_context_keys(self):

Review Comment:
   I think I also need to add a bit smarter _iter_serializable_context_keys - but I will eed to check if installed venv contains airflow/pendulum respectively. (or we could make it a requirement). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r962837415


##########
tests/decorators/test_external_python.py:
##########
@@ -0,0 +1,101 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import datetime
+import sys
+from datetime import timedelta
+from subprocess import CalledProcessError
+
+import pytest
+
+from airflow.decorators import task
+from airflow.utils import timezone
+
+DEFAULT_DATE = timezone.datetime(2016, 1, 1)
+END_DATE = timezone.datetime(2016, 1, 2)
+INTERVAL = timedelta(hours=12)
+FROZEN_NOW = timezone.datetime(2016, 1, 2, 12, 1, 1)
+
+TI_CONTEXT_ENV_VARS = [
+    'AIRFLOW_CTX_DAG_ID',
+    'AIRFLOW_CTX_TASK_ID',
+    'AIRFLOW_CTX_EXECUTION_DATE',
+    'AIRFLOW_CTX_DAG_RUN_ID',
+]
+
+
+PYTHON_VERSION = sys.version_info[0]
+
+# Technically Not a separate virtualenv but should be good enough for unit tests
+PYTHON = sys.executable
+
+
+class TestExternalPythonDecorator:
+    def test_add_dill(self, dag_maker):
+        @task.external_python(python=PYTHON, use_dill=True)
+        def f():
+            """Ensure dill is correctly installed."""
+            import dill  # noqa: F401
+
+        with dag_maker():
+            ret = f()
+
+        ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_fail(self, dag_maker):
+        @task.external_python(python=PYTHON)
+        def f():
+            raise Exception
+
+        with dag_maker():
+            ret = f()
+
+        with pytest.raises(CalledProcessError):
+            ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_with_args(self, dag_maker):
+        @task.external_python(python=PYTHON)
+        def f(a, b, c=False, d=False):
+            if a == 0 and b == 1 and c and not d:
+                return True
+            else:
+                raise Exception
+
+        with dag_maker():
+            ret = f(0, 1, c=True)
+
+        ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_return_none(self, dag_maker):
+        @task.external_python(python=PYTHON)
+        def f():
+            return None
+
+        with dag_maker():
+            ret = f()
+
+        ret.operator.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE)
+
+    def test_nonimported_as_arg(self, dag_maker):

Review Comment:
   Don't think this case is needed either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r963035629


##########
airflow/operators/python.py:
##########
@@ -501,27 +555,152 @@ def _iter_serializable_context_keys(self):
         elif 'pendulum' in self.requirements:
             yield from self.PENDULUM_SERIALIZABLE_CONTEXT_KEYS
 
-    def _write_string_args(self, filename):
-        with open(filename, 'w') as file:
-            file.write('\n'.join(map(str, self.string_args)))
 
-    def _read_result(self, filename):
-        if os.stat(filename).st_size == 0:
-            return None
-        with open(filename, 'rb') as file:
-            try:
-                return self.pickling_library.load(file)
-            except ValueError:
-                self.log.error(
-                    "Error deserializing result. Note that result deserialization "
-                    "is not supported across major Python versions."
+class ExternalPythonOperator(_BasePythonVirtualenvOperator):
+    """
+    Allows one to run a function in a virtualenv that is not re-created but used as is
+    without the overhead of creating the virtualenv (with certain caveats).
+
+    The function must be defined using def, and not be
+    part of a class. All imports must happen inside the function
+    and no variables outside the scope may be referenced. A global scope
+    variable named virtualenv_string_args will be available (populated by
+    string_args). In addition, one can pass stuff through op_args and op_kwargs, and one
+    can use a return value.
+    Note that if your virtualenv runs in a different Python major version than Airflow,
+    you cannot use return values, op_args, op_kwargs, or use any macros that are being provided to
+    Airflow through plugins. You can use string_args though.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the guide:
+        :ref:`howto/operator:ExternalPythonOperator`
+
+    :param python: Full path string (file-system specific) that points to a Python binary inside
+        a virtualenv that should be used (in ``VENV/bin`` folder). Should be absolute path
+        (so usually start with "/" or "X:/" depending on the filesystem/os used).
+    :param python_callable: A python function with no references to outside variables,
+        defined with def, which will be run in a virtualenv
+    :param use_dill: Whether to use dill to serialize
+        the args and result (pickle is default). This allow more complex types
+        but if dill is not preinstalled in your venv, the task will fail with use_dill enabled.
+    :param op_args: A list of positional arguments to pass to python_callable.
+    :param op_kwargs: A dict of keyword arguments to pass to python_callable.
+    :param string_args: Strings that are present in the global var virtualenv_string_args,
+        available to python_callable at runtime as a list[str]. Note that args are split
+        by newline.
+    :param templates_dict: a dictionary where the values are templates that
+        will get templated by the Airflow engine sometime between
+        ``__init__`` and ``execute`` takes place and are made available
+        in your callable's context after the template has been applied
+    :param templates_exts: a list of file extensions to resolve while
+        processing templated fields, for examples ``['.sql', '.hql']``
+    """
+
+    template_fields: Sequence[str] = tuple({'python_path'} | set(PythonOperator.template_fields))
+
+    def __init__(
+        self,
+        *,
+        python: str,
+        python_callable: Callable,
+        use_dill: bool = False,
+        op_args: Optional[Collection[Any]] = None,
+        op_kwargs: Optional[Mapping[str, Any]] = None,
+        string_args: Optional[Iterable[str]] = None,
+        templates_dict: Optional[Dict] = None,
+        templates_exts: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        if not python:
+            raise ValueError("Python Path must be defined in ExternalPythonOperator")
+        self.python = python
+        super().__init__(
+            python_callable=python_callable,
+            use_dill=use_dill,
+            op_args=op_args,
+            op_kwargs=op_kwargs,
+            string_args=string_args,
+            templates_dict=templates_dict,
+            templates_exts=templates_exts,
+            **kwargs,
+        )
+
+    def execute_callable(self):
+        python_path = Path(self.python)
+        if not python_path.exists():
+            raise ValueError(f"Python Path '{python_path}' must exists")
+        if not python_path.is_file():
+            raise ValueError(f"Python Path '{python_path}' must be a file")
+        if not python_path.is_absolute():
+            raise ValueError(f"Python Path '{python_path}' must be an absolute path.")
+        python_version_as_list_of_strings = self._get_python_version_from_venv()
+        if (
+            python_version_as_list_of_strings
+            and str(python_version_as_list_of_strings[0]) != str(sys.version_info.major)
+            and (self.op_args or self.op_kwargs)
+        ):
+            raise AirflowException(
+                "Passing op_args or op_kwargs is not supported across different Python "
+                "major versions for ExternalPythonOperator. Please use string_args."
+                f"Sys version: {sys.version_info}. Venv version: {python_version_as_list_of_strings}"
+            )
+        with TemporaryDirectory(prefix='tmd') as tmp_dir:
+            tmp_path = Path(tmp_dir)
+            return self._execute_python_callable_in_subprocess(python_path, tmp_path)
+
+    def _get_virtualenv_path(self) -> Path:
+        return Path(self.python).parents[1]
+
+    def _get_python_version_from_venv(self) -> List[str]:
+        try:
+            result = subprocess.check_output([self.python, "--version"], text=True)
+            return result.strip().split(" ")[-1].split(".")
+        except Exception as e:
+            raise ValueError(f"Error while executing {self.python}: {e}")
+
+    def _get_airflow_version_from_venv(self) -> Optional[str]:
+        try:
+            result = subprocess.check_output(
+                [self.python, "-c", "from airflow import version; print(version.version)"], text=True
+            )
+            venv_airflow_version = result.strip()
+            if venv_airflow_version != airflow_version:
+                raise AirflowConfigException(
+                    f"The version of Airflow installed in the virtualenv {self._get_virtualenv_path()}: "
+                    f"{venv_airflow_version} is different than the runtime Airflow version: "
+                    f"{airflow_version}. Make sure your environment has the same Airflow version "
+                    f"installed as the Airflow runtime."
                 )
-                raise
+            return venv_airflow_version
+        except Exception as e:
+            self.log.info("When checking for Airflow installed in venv got %s", e)
+            self.log.info(
+                f"This means that Airflow is not properly installed in the virtualenv "
+                f"{self._get_virtualenv_path()}. Airflow context keys will not be available. "
+                f"Please Install Airflow {airflow_version} in your venv to access them."
+            )

Review Comment:
   I've added "expect_airflow" and "expect_pendulum" and raise those logs to warning levels if someone sets it to True, but skip the logs entirely otherwise. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement ExternalPythonOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r961984441


##########
airflow/example_dags/example_python_operator.py:
##########
@@ -32,6 +36,20 @@
 
 log = logging.getLogger(__name__)
 
+PYTHON = sys.executable
+
+BASE_DIR = tempfile.gettempdir()
+EXTERNAL_PYTHON_ENV = Path(BASE_DIR, "venv-for-system-tests")
+
+# [START howto_initial_operator_external_python]
+
+EXTERNAL_PYTHON_PATH = EXTERNAL_PYTHON_ENV / "bin" / "python"
+
+# [END howto_initial_operator_external_python]
+
+if not EXTERNAL_PYTHON_PATH.exists():
+    venv.create(EXTERNAL_PYTHON_ENV)

Review Comment:
   Actually - that was a very good call @mik-laj . During testing it, I found that the operator expected the venv during parsing - which was completely unnnecessary. I changed it slightly after your comment:
   
   * I am using `sys.executable` in the example as external Python. Not really the "venv" case but since we changed it to ExternalPythonOperator, it's actuallly a valid case now :D 
   
   * the 'check if venv exist is moved to "execute" - which will make it possible to not have the venv defined during parsing 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org