You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Abhishek Bhakat <ab...@astronomer.io.INVALID> on 2022/09/01 04:40:23 UTC

Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Would like to vote for ExternalPythonOperator.
Cause usually Virtualenv have symbolic links for python binaries untill
used —copies to make it fully portable.
Additionally there is option to use differently compiled python altogether
(For example pypy <https://www.pypy.org/index.html> or jython
<https://www.jython.org/>). Naming these "External Pythons" makes more
sense to me.

Thanks,
Abhishek

On 31-Aug-2022 at 9:30:42 PM, Ash Berlin-Taylor <as...@apache.org> wrote:

> Personally if those two I greatly prefer ExternalPythonOperator. (I didn't
> vote for either of those)
>
> (Also I think PythonExternalEnvOperator would be the "correct" casing,
> Virtualenv is a thing in python, Externalenv isn't.)
>
> -ash
>
> On 31 August 2022 21:28:20 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> We've got 56 votes (wow!)
>>
>> ExternalPythonOperator won. It got 41% . Followed by
>> PythonExternalenvOperator 30% and PythonRunenvOperator with 26%.
>>
>> I am fine with either of those. But - despite slightly lower support - I
>> think PythonExternalenvOperator reflects a bit better the resemblance to
>> PythonVirtualenvOperator that I think is important.
>>
>> Asking those who were very strong on ExternalPythonOperator - is
>> PythonExternalenvOperator "good enough" for you as well?
>>
>> The poll had only one option to choose from, but if that is an acceptable
>> option for those who favoured "ExternalPythonOperator" - I have personally
>> a slight preference for that one.
>>
>> J.
>>
>>
>>
>>
>> On Wed, Aug 31, 2022 at 3:10 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Just 5 hours left to change the world!
>>>
>>> You can become one of the people who influenced the decision on naming
>>> the new operator :D
>>>
>>> https://twitter.com/jarekpotiuk/status/1563602012100767746
>>>
>>> (Right, maybe changing the world just a little, but still)
>>>
>>> J.
>>>
>>>
>>> On Sat, Aug 27, 2022 at 9:01 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> Seems we are only now at the stage that we need to choose the best name
>>>> for the operator
>>>>
>>>> I started a name poll on Twitter :)
>>>>
>>>> https://twitter.com/jarekpotiuk/status/1563602012100767746
>>>>
>>>> PR here: https://github.com/apache/airflow/pull/25780
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>>> On Thu, Aug 18, 2022 at 1:53 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> Draft PR - needs some more tests and review with typing changes - in
>>>>> https://github.com/apache/airflow/pull/25780
>>>>> Eventually PythonExternalOperator seems like a good name.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On Wed, Aug 17, 2022 at 10:37 PM Jeambrun Pierre <
>>>>> pierrejbrun@gmail.com> wrote:
>>>>>
>>>>>> I also like the ability to use a specific interpreter.
>>>>>>
>>>>>> Maybe we could leave everything that is env related to the PVO (even
>>>>>> using an existing one) and let another one handle the interpreter.
>>>>>>
>>>>>> As Ash mentioned I also feel like an additional parameter
>>>>>> (python/interpreter etc.) to the PO would make sense and is quite intuitive
>>>>>> rather than a complete new operator, but it might be harder to implement.
>>>>>>
>>>>>> Best
>>>>>> Pierre Jeambrun
>>>>>>
>>>>>> Le mer. 17 août 2022 à 20:46, Collin McNulty
>>>>>> <co...@astronomer.io.invalid> a écrit :
>>>>>>
>>>>>>> I concur that this would be very useful. I can see a common pattern
>>>>>>> being to have a task to create an environment if it does not already exist
>>>>>>> and then subsequent tasks use that environment.
>>>>>>>
>>>>>>> On Wed, Aug 17, 2022 at 12:30 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sounds like this is really in the middle between PVO and PO :).
>>>>>>>>
>>>>>>>> BTW. I spoke with a customer of mine today and they said they would
>>>>>>>> ABSOLUTELY love it. They were actually blocked from migrating to
>>>>>>>> 2.3.3
>>>>>>>> because one of their teams needed a DBT environment while the other
>>>>>>>> team needed some other dependency and they are conflicting with each
>>>>>>>> other. They are using Nomad + Docker already and while extending the
>>>>>>>> image with another venv is super-easy for them, they were
>>>>>>>> considering
>>>>>>>> building several Docker images to serve their users but it is an
>>>>>>>> order
>>>>>>>> of magnitude more complex problem for them because they would have
>>>>>>>> to
>>>>>>>> make a whole new pipeline to build a distribute multiple images and
>>>>>>>> implements queue-base split between the teams or switch to using
>>>>>>>> DockerOperator.
>>>>>>>>
>>>>>>>> This one will allow them to do limited version of multi-tenancy for
>>>>>>>> their teams - without the actual separation but with even more
>>>>>>>> fine-grained separation of envs - because they would be able to use
>>>>>>>> different deps even for different tasks in the same DAG.
>>>>>>>>
>>>>>>>>
>>>>>>>> J,
>>>>>>>>
>>>>>>>> On Wed, Aug 17, 2022 at 6:21 PM Ash Berlin-Taylor <as...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Another option would be to change the PythonOperator/@task to
>>>>>>>> take a `python` argument (which also does change the behaviour of _that_
>>>>>>>> operator a lot with or without that argument if we did that.)
>>>>>>>> >
>>>>>>>> > On 17 August 2022 15:46:52 BST, Jarek Potiuk <ja...@potiuk.com>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> Yeah. TP - I like that explicit separation. It's much cleaner. I
>>>>>>>> still
>>>>>>>> >> have to think about the name though. While I see where
>>>>>>>> >> ExternalPythonOperator comes from,  It sounds a bit less than
>>>>>>>> obvious.
>>>>>>>> >> I think the name should somehow contain "Environment" because
>>>>>>>> very few
>>>>>>>> >> people realise that running Python from a virtualenv actually
>>>>>>>> >> implicitly "activates" the venv.
>>>>>>>> >> I think maybe deprecating the old PythonVirtualenvOperator and
>>>>>>>> >> introducing two new operators: PythonInCreatedVirtualEnvOperator,
>>>>>>>> >> PythonInExistingVirtualEnvOperator ? Not exactly those names -
>>>>>>>> they
>>>>>>>> >> are too long - but something like that. Maybe we should get rid
>>>>>>>> of
>>>>>>>> >> Python in the name at all ?
>>>>>>>> >>
>>>>>>>> >> BTW. I think we should generally do more of the discussions here
>>>>>>>> and
>>>>>>>> >> express our thoughts about Airflow here. Even if there are no
>>>>>>>> answers
>>>>>>>> >> or interest immediately, I think that it makes sense to do a bit
>>>>>>>> of a
>>>>>>>> >> melting pot that sometimes might produce some cool (or rather
>>>>>>>> hot)
>>>>>>>> >> stuff as a result.
>>>>>>>> >>
>>>>>>>> >> On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung
>>>>>>>> <tp...@astronomer.io.invalid> wrote:
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>>  One thing I thought of (but never bothered to write about) is
>>>>>>>> to introduce a separate operator instead, say ExternalPythonOperator (bike
>>>>>>>> shedding on name is welcomed), that explicitly takes a path to the
>>>>>>>> interpreter (say in a virtual environment) and just use that to run the
>>>>>>>> code. This also enables users to create a virtual environment upfront, but
>>>>>>>> avoids needing to overload PythonVirtualenvOperator for the purpose. This
>>>>>>>> also opens an extra use case that you can use any Python installation to
>>>>>>>> run the code (say a custom-compiled interpreter), although nobody asked
>>>>>>>> about that.
>>>>>>>> >>>
>>>>>>>> >>>  TP
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>>  On 13 Aug 2022, at 02:52, Jeambrun Pierre <
>>>>>>>> pierrejbrun@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>>  I feel like this is a great alternative at the price of a very
>>>>>>>> moderate effort. (I'd be glad to help with it).
>>>>>>>> >>>
>>>>>>>> >>>  Mutually exclusive sounds good to me as well.
>>>>>>>> >>>
>>>>>>>> >>>  Best,
>>>>>>>> >>>  Pierre
>>>>>>>> >>>
>>>>>>>> >>>  Le ven. 12 août 2022 à 15:23, Jarek Potiuk <ja...@potiuk.com>
>>>>>>>> a écrit :
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>  Mutually exclusive. I think that has the nice property of
>>>>>>>> forcing people to prepare immutable venvs upfront.
>>>>>>>> >>>>
>>>>>>>> >>>>  On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor <
>>>>>>>> ash@apache.org> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Yes, this has been on my background idea list for an age --
>>>>>>>> I'd love to see it happen!
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Have you thought about how it would behave when you specify
>>>>>>>> an existing virtualenv and include requirements in the operator that are
>>>>>>>> not already installed there? Or would they be mutually exclusive? (I don't
>>>>>>>> mind either way, just wondering which way you are heading)
>>>>>>>> >>>>>
>>>>>>>> >>>>>  -ash
>>>>>>>> >>>>>
>>>>>>>> >>>>>  On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk <
>>>>>>>> jarek@potiuk.com> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Hello everyone,
>>>>>>>> >>>>>
>>>>>>>> >>>>>  TL;DR; I propose to extend our PythonVirtualenvOperator with
>>>>>>>> "use existing venv" feature and make it a viable way of handling some
>>>>>>>> multi-dependency sets using multiple pre-installed venvs.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  More context:
>>>>>>>> >>>>>
>>>>>>>> >>>>>  I had this idea coming after a discussion in our Slack:
>>>>>>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179
>>>>>>>> >>>>>
>>>>>>>> >>>>>  My thoughts were - why don't we add support for "use
>>>>>>>> existing venv" in PythonVirtualenvOperator as first-class-citizen ?
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Currently (unless there are some tricks I am not aware of)
>>>>>>>> or extend PVO, the PVO will always attempt to create a virtualenv based on
>>>>>>>> extra requirements. And while it gives the users a possibility of having
>>>>>>>> some tasks use different dependencies, the drawback is that the venv is
>>>>>>>> created dynamically when tasks starts - potentially a lot of overhead for
>>>>>>>> startup time and some unpleasant failure scenarios - like networking
>>>>>>>> problems, PyPI or local repoi not available, automated (and unnoticed)
>>>>>>>> upgrade of dependencies.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Those are basically the same problems that caused us to
>>>>>>>> strongly discourage our users in our Helm Chart to use
>>>>>>>> _PIP_ADDITIONAL_DEPENDENCIES in production and criticize the  Community
>>>>>>>> Helm Chart for dynamic dependency installation they promote as a "valid"
>>>>>>>> approach. Yet our PVO currently does exactly this.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  We had some past discussions how this can be improved - with
>>>>>>>> caching, or using different images for different dependencies and similar -
>>>>>>>> and even we have
>>>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing
>>>>>>>> proposal to use different images for different sets of requirements.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Proposal:
>>>>>>>> >>>>>
>>>>>>>> >>>>>  During the discussion yesterday I started to think a simpler
>>>>>>>> solution is possible and rather simple to implement by us and for users to
>>>>>>>> use.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Why not have different venvs preinstalled and let the PVO
>>>>>>>> choose the one that should be used?
>>>>>>>> >>>>>
>>>>>>>> >>>>>  It does not invalidate AIP-46. AIP-46 serves a bit different
>>>>>>>> purpose and some cases cannot be handled this way - when you need different
>>>>>>>> "system level" dependencies for example) but it might be much simpler from
>>>>>>>> deployment point of view and allow it to handle "multi-dependency sets" for
>>>>>>>> Python libraries only with minimal deployment overhead (which AIP-46
>>>>>>>> necessarily has). And I think it will be enough for a vast number of the
>>>>>>>> "multi-dependency-sets" cases.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Why don't we allow the users to prepare those venvs upfront
>>>>>>>> and simply enable PVE to use them rather than create them dynamically ?
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Advantages:
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * it nicely handles cases where some of your tasks need a
>>>>>>>> different set of dependencies than others (for execution, not necessarily
>>>>>>>> parsing at least initially).
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * no startup time overhead needed as with current PVO
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * possible to run in both cases - "venv installation" and
>>>>>>>> "docker image" installation
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * it has finer granularity level than AIP-46 - unlike in
>>>>>>>> AIP-46 you could use different sets of dependencies
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * very easy to pull off for the users without modifying
>>>>>>>> their deployments,For local venv, you just create the venvs, For Docker
>>>>>>>> image case, your custom image needs to add several lines similar to:
>>>>>>>> >>>>>
>>>>>>>> >>>>>  RUN python -m venv --system-site-packages PACKAGE1==NN
>>>>>>>> PACKAGE2==NN /opt/venv1
>>>>>>>> >>>>>  RUN python -m venv --system-site-packages PACKAGE1==NN
>>>>>>>> PACKAGE2==NN /opt/venv2
>>>>>>>> >>>>>
>>>>>>>> >>>>>  and PythonVenvOperator should have extra
>>>>>>>> "use_existing_venv=/opt/venv2") parameter
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * we only need to manage ONE image (!) even if you have
>>>>>>>> multiple sets of dependencies (this has the advantage that it is actually
>>>>>>>> LOWER overhead than having separate images for each env -when it comes to
>>>>>>>> various resources overhead (same workers could handle multiple dependency
>>>>>>>> sets for examples, same image is reused by multiple PODs in K8S etc. ).
>>>>>>>> >>>>>
>>>>>>>> >>>>>  * later (when AIP-43 (separate dag processor with ability to
>>>>>>>> use different processors for different subdirectories) is completed and
>>>>>>>> AIP-46 is approved/implemented, we could also extend DAG Parsing to be able
>>>>>>>> to use those predefined venvs for parsing. That would eliminate the need
>>>>>>>> for local imports and add support to even use different sets of libraries
>>>>>>>> in top-level code (per DAG, not per task). It would not solve different
>>>>>>>> "system" level dependencies - and for that AiP-46 is still a very valid
>>>>>>>> case.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  Disadvantages:
>>>>>>>> >>>>>
>>>>>>>> >>>>>  I thought very hard about this one and I actually could not
>>>>>>>> find any disadvantages :)
>>>>>>>> >>>>>
>>>>>>>> >>>>>  It's simple to implement, use and explain, it can be
>>>>>>>> implemented very quickly (like - in a few hours with tests and
>>>>>>>> documentation I think) and performance-wise it is better for any other
>>>>>>>> solution (including AIP-46) providing that the case is limited to different
>>>>>>>> Python dependencies.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  But possibly there are things that I missed. It all looks
>>>>>>>> too good to be true, and I wonder why we do not have it already today -
>>>>>>>> once I thought about it, it seems very obvious. So I probably missed
>>>>>>>> something.
>>>>>>>> >>>>>
>>>>>>>> >>>>>  WDYT?
>>>>>>>> >>>>>
>>>>>>>> >>>>>  J.
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>
>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Collin McNulty
>>>>>>> Lead Airflow Engineer
>>>>>>>
>>>>>>> Email: collin@astronomer.io <jo...@astronomer.io>
>>>>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>>>>>>
>>>>>>>
>>>>>>> <https://www.astronomer.io/>
>>>>>>>
>>>>>>

Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Posted by Jarek Potiuk <ja...@potiuk.com>.
Right there it goes : https://github.com/apache/airflow/pull/25780

ExternalPytonOperator. I also took the opportunity to review and update
(and make consistent) the documentation about "Managing conflicting/complex
dependencies" - the subject often raised by our users.

There were a few inconsistencies (and we did not have embedded examples
from the - recently merged Kubernetes decorator

With this change we have:

* Best Practices entry explaining the "motivation", "strategies" and
consequences of using different approaches (PythonVirtualenv,
ExternalPython, Docker, Kubernetes). This should help our users to make
better decisions when to use which strategy.
* The Task Flow Tutorial chapter about "complex/conflicting dependencies"
is now better/more consistently organized to show the user on how they can
use all those 4 decorators to do that.

Those docs are now cross-linked - you can easily go from the
"best-practices" to examples and back. Looking forward to reviews -
especially the docs part (as usual) is better if multiple pairs of eyes are
looking at it.

J.




On Thu, Sep 1, 2022 at 10:51 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Fine :). Let it be then :)
>
> On Thu, Sep 1, 2022 at 6:40 AM Abhishek Bhakat
> <ab...@astronomer.io.invalid> wrote:
>
>> Would like to vote for ExternalPythonOperator.
>> Cause usually Virtualenv have symbolic links for python binaries untill
>> used —copies to make it fully portable.
>> Additionally there is option to use differently compiled python
>> altogether (For example pypy <https://www.pypy.org/index.html> or jython
>> <https://www.jython.org/>). Naming these "External Pythons" makes more
>> sense to me.
>>
>> Thanks,
>> Abhishek
>>
>> On 31-Aug-2022 at 9:30:42 PM, Ash Berlin-Taylor <as...@apache.org> wrote:
>>
>>> Personally if those two I greatly prefer ExternalPythonOperator. (I
>>> didn't vote for either of those)
>>>
>>> (Also I think PythonExternalEnvOperator would be the "correct" casing,
>>> Virtualenv is a thing in python, Externalenv isn't.)
>>>
>>> -ash
>>>
>>> On 31 August 2022 21:28:20 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>> We've got 56 votes (wow!)
>>>>
>>>> ExternalPythonOperator won. It got 41% . Followed by
>>>> PythonExternalenvOperator 30% and PythonRunenvOperator with 26%.
>>>>
>>>> I am fine with either of those. But - despite slightly lower support -
>>>> I think PythonExternalenvOperator reflects a bit better the resemblance to
>>>> PythonVirtualenvOperator that I think is important.
>>>>
>>>> Asking those who were very strong on ExternalPythonOperator - is
>>>> PythonExternalenvOperator "good enough" for you as well?
>>>>
>>>> The poll had only one option to choose from, but if that is an
>>>> acceptable option for those who favoured "ExternalPythonOperator" - I have
>>>> personally a slight preference for that one.
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 31, 2022 at 3:10 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> Just 5 hours left to change the world!
>>>>>
>>>>> You can become one of the people who influenced the decision on naming
>>>>> the new operator :D
>>>>>
>>>>> https://twitter.com/jarekpotiuk/status/1563602012100767746
>>>>>
>>>>> (Right, maybe changing the world just a little, but still)
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On Sat, Aug 27, 2022 at 9:01 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>
>>>>>> Seems we are only now at the stage that we need to choose the best
>>>>>> name for the operator
>>>>>>
>>>>>> I started a name poll on Twitter :)
>>>>>>
>>>>>> https://twitter.com/jarekpotiuk/status/1563602012100767746
>>>>>>
>>>>>> PR here: https://github.com/apache/airflow/pull/25780
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 18, 2022 at 1:53 AM Jarek Potiuk <ja...@potiuk.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Draft PR - needs some more tests and review with typing changes - in
>>>>>>> https://github.com/apache/airflow/pull/25780
>>>>>>> Eventually PythonExternalOperator seems like a good name.
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 17, 2022 at 10:37 PM Jeambrun Pierre <
>>>>>>> pierrejbrun@gmail.com> wrote:
>>>>>>>
>>>>>>>> I also like the ability to use a specific interpreter.
>>>>>>>>
>>>>>>>> Maybe we could leave everything that is env related to the PVO
>>>>>>>> (even using an existing one) and let another one handle the interpreter.
>>>>>>>>
>>>>>>>> As Ash mentioned I also feel like an additional parameter
>>>>>>>> (python/interpreter etc.) to the PO would make sense and is quite intuitive
>>>>>>>> rather than a complete new operator, but it might be harder to implement.
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Pierre Jeambrun
>>>>>>>>
>>>>>>>> Le mer. 17 août 2022 à 20:46, Collin McNulty
>>>>>>>> <co...@astronomer.io.invalid> a écrit :
>>>>>>>>
>>>>>>>>> I concur that this would be very useful. I can see a common
>>>>>>>>> pattern being to have a task to create an environment if it does not
>>>>>>>>> already exist and then subsequent tasks use that environment.
>>>>>>>>>
>>>>>>>>> On Wed, Aug 17, 2022 at 12:30 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sounds like this is really in the middle between PVO and PO :).
>>>>>>>>>>
>>>>>>>>>> BTW. I spoke with a customer of mine today and they said they
>>>>>>>>>> would
>>>>>>>>>> ABSOLUTELY love it. They were actually blocked from migrating to
>>>>>>>>>> 2.3.3
>>>>>>>>>> because one of their teams needed a DBT environment while the
>>>>>>>>>> other
>>>>>>>>>> team needed some other dependency and they are conflicting with
>>>>>>>>>> each
>>>>>>>>>> other. They are using Nomad + Docker already and while extending
>>>>>>>>>> the
>>>>>>>>>> image with another venv is super-easy for them, they were
>>>>>>>>>> considering
>>>>>>>>>> building several Docker images to serve their users but it is an
>>>>>>>>>> order
>>>>>>>>>> of magnitude more complex problem for them because they would
>>>>>>>>>> have to
>>>>>>>>>> make a whole new pipeline to build a distribute multiple images
>>>>>>>>>> and
>>>>>>>>>> implements queue-base split between the teams or switch to using
>>>>>>>>>> DockerOperator.
>>>>>>>>>>
>>>>>>>>>> This one will allow them to do limited version of multi-tenancy
>>>>>>>>>> for
>>>>>>>>>> their teams - without the actual separation but with even more
>>>>>>>>>> fine-grained separation of envs - because they would be able to
>>>>>>>>>> use
>>>>>>>>>> different deps even for different tasks in the same DAG.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> J,
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 17, 2022 at 6:21 PM Ash Berlin-Taylor <as...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Another option would be to change the PythonOperator/@task to
>>>>>>>>>> take a `python` argument (which also does change the behaviour of _that_
>>>>>>>>>> operator a lot with or without that argument if we did that.)
>>>>>>>>>> >
>>>>>>>>>> > On 17 August 2022 15:46:52 BST, Jarek Potiuk <ja...@potiuk.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> Yeah. TP - I like that explicit separation. It's much cleaner.
>>>>>>>>>> I still
>>>>>>>>>> >> have to think about the name though. While I see where
>>>>>>>>>> >> ExternalPythonOperator comes from,  It sounds a bit less than
>>>>>>>>>> obvious.
>>>>>>>>>> >> I think the name should somehow contain "Environment" because
>>>>>>>>>> very few
>>>>>>>>>> >> people realise that running Python from a virtualenv actually
>>>>>>>>>> >> implicitly "activates" the venv.
>>>>>>>>>> >> I think maybe deprecating the old PythonVirtualenvOperator and
>>>>>>>>>> >> introducing two new operators:
>>>>>>>>>> PythonInCreatedVirtualEnvOperator,
>>>>>>>>>> >> PythonInExistingVirtualEnvOperator ? Not exactly those names -
>>>>>>>>>> they
>>>>>>>>>> >> are too long - but something like that. Maybe we should get
>>>>>>>>>> rid of
>>>>>>>>>> >> Python in the name at all ?
>>>>>>>>>> >>
>>>>>>>>>> >> BTW. I think we should generally do more of the discussions
>>>>>>>>>> here and
>>>>>>>>>> >> express our thoughts about Airflow here. Even if there are no
>>>>>>>>>> answers
>>>>>>>>>> >> or interest immediately, I think that it makes sense to do a
>>>>>>>>>> bit of a
>>>>>>>>>> >> melting pot that sometimes might produce some cool (or rather
>>>>>>>>>> hot)
>>>>>>>>>> >> stuff as a result.
>>>>>>>>>> >>
>>>>>>>>>> >> On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung
>>>>>>>>>> <tp...@astronomer.io.invalid> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>>  One thing I thought of (but never bothered to write about)
>>>>>>>>>> is to introduce a separate operator instead, say ExternalPythonOperator
>>>>>>>>>> (bike shedding on name is welcomed), that explicitly takes a path to the
>>>>>>>>>> interpreter (say in a virtual environment) and just use that to run the
>>>>>>>>>> code. This also enables users to create a virtual environment upfront, but
>>>>>>>>>> avoids needing to overload PythonVirtualenvOperator for the purpose. This
>>>>>>>>>> also opens an extra use case that you can use any Python installation to
>>>>>>>>>> run the code (say a custom-compiled interpreter), although nobody asked
>>>>>>>>>> about that.
>>>>>>>>>> >>>
>>>>>>>>>> >>>  TP
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>>  On 13 Aug 2022, at 02:52, Jeambrun Pierre <
>>>>>>>>>> pierrejbrun@gmail.com> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>>  I feel like this is a great alternative at the price of a
>>>>>>>>>> very moderate effort. (I'd be glad to help with it).
>>>>>>>>>> >>>
>>>>>>>>>> >>>  Mutually exclusive sounds good to me as well.
>>>>>>>>>> >>>
>>>>>>>>>> >>>  Best,
>>>>>>>>>> >>>  Pierre
>>>>>>>>>> >>>
>>>>>>>>>> >>>  Le ven. 12 août 2022 à 15:23, Jarek Potiuk <ja...@potiuk.com>
>>>>>>>>>> a écrit :
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>  Mutually exclusive. I think that has the nice property of
>>>>>>>>>> forcing people to prepare immutable venvs upfront.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>  On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor <
>>>>>>>>>> ash@apache.org> wrote:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Yes, this has been on my background idea list for an age
>>>>>>>>>> -- I'd love to see it happen!
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Have you thought about how it would behave when you
>>>>>>>>>> specify an existing virtualenv and include requirements in the operator
>>>>>>>>>> that are not already installed there? Or would they be mutually exclusive?
>>>>>>>>>> (I don't mind either way, just wondering which way you are heading)
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  -ash
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk <
>>>>>>>>>> jarek@potiuk.com> wrote:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Hello everyone,
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  TL;DR; I propose to extend our PythonVirtualenvOperator
>>>>>>>>>> with "use existing venv" feature and make it a viable way of handling some
>>>>>>>>>> multi-dependency sets using multiple pre-installed venvs.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  More context:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  I had this idea coming after a discussion in our Slack:
>>>>>>>>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  My thoughts were - why don't we add support for "use
>>>>>>>>>> existing venv" in PythonVirtualenvOperator as first-class-citizen ?
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Currently (unless there are some tricks I am not aware of)
>>>>>>>>>> or extend PVO, the PVO will always attempt to create a virtualenv based on
>>>>>>>>>> extra requirements. And while it gives the users a possibility of having
>>>>>>>>>> some tasks use different dependencies, the drawback is that the venv is
>>>>>>>>>> created dynamically when tasks starts - potentially a lot of overhead for
>>>>>>>>>> startup time and some unpleasant failure scenarios - like networking
>>>>>>>>>> problems, PyPI or local repoi not available, automated (and unnoticed)
>>>>>>>>>> upgrade of dependencies.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Those are basically the same problems that caused us to
>>>>>>>>>> strongly discourage our users in our Helm Chart to use
>>>>>>>>>> _PIP_ADDITIONAL_DEPENDENCIES in production and criticize the  Community
>>>>>>>>>> Helm Chart for dynamic dependency installation they promote as a "valid"
>>>>>>>>>> approach. Yet our PVO currently does exactly this.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  We had some past discussions how this can be improved -
>>>>>>>>>> with caching, or using different images for different dependencies and
>>>>>>>>>> similar - and even we have
>>>>>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing
>>>>>>>>>> proposal to use different images for different sets of requirements.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Proposal:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  During the discussion yesterday I started to think a
>>>>>>>>>> simpler solution is possible and rather simple to implement by us and for
>>>>>>>>>> users to use.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Why not have different venvs preinstalled and let the PVO
>>>>>>>>>> choose the one that should be used?
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  It does not invalidate AIP-46. AIP-46 serves a bit
>>>>>>>>>> different purpose and some cases cannot be handled this way - when you need
>>>>>>>>>> different "system level" dependencies for example) but it might be much
>>>>>>>>>> simpler from deployment point of view and allow it to handle
>>>>>>>>>> "multi-dependency sets" for Python libraries only with minimal deployment
>>>>>>>>>> overhead (which AIP-46 necessarily has). And I think it will be enough for
>>>>>>>>>> a vast number of the "multi-dependency-sets" cases.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Why don't we allow the users to prepare those venvs
>>>>>>>>>> upfront and simply enable PVE to use them rather than create them
>>>>>>>>>> dynamically ?
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Advantages:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * it nicely handles cases where some of your tasks need a
>>>>>>>>>> different set of dependencies than others (for execution, not necessarily
>>>>>>>>>> parsing at least initially).
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * no startup time overhead needed as with current PVO
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * possible to run in both cases - "venv installation" and
>>>>>>>>>> "docker image" installation
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * it has finer granularity level than AIP-46 - unlike in
>>>>>>>>>> AIP-46 you could use different sets of dependencies
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * very easy to pull off for the users without modifying
>>>>>>>>>> their deployments,For local venv, you just create the venvs, For Docker
>>>>>>>>>> image case, your custom image needs to add several lines similar to:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  RUN python -m venv --system-site-packages PACKAGE1==NN
>>>>>>>>>> PACKAGE2==NN /opt/venv1
>>>>>>>>>> >>>>>  RUN python -m venv --system-site-packages PACKAGE1==NN
>>>>>>>>>> PACKAGE2==NN /opt/venv2
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  and PythonVenvOperator should have extra
>>>>>>>>>> "use_existing_venv=/opt/venv2") parameter
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * we only need to manage ONE image (!) even if you have
>>>>>>>>>> multiple sets of dependencies (this has the advantage that it is actually
>>>>>>>>>> LOWER overhead than having separate images for each env -when it comes to
>>>>>>>>>> various resources overhead (same workers could handle multiple dependency
>>>>>>>>>> sets for examples, same image is reused by multiple PODs in K8S etc. ).
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  * later (when AIP-43 (separate dag processor with ability
>>>>>>>>>> to use different processors for different subdirectories) is completed and
>>>>>>>>>> AIP-46 is approved/implemented, we could also extend DAG Parsing to be able
>>>>>>>>>> to use those predefined venvs for parsing. That would eliminate the need
>>>>>>>>>> for local imports and add support to even use different sets of libraries
>>>>>>>>>> in top-level code (per DAG, not per task). It would not solve different
>>>>>>>>>> "system" level dependencies - and for that AiP-46 is still a very valid
>>>>>>>>>> case.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  Disadvantages:
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  I thought very hard about this one and I actually could
>>>>>>>>>> not find any disadvantages :)
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  It's simple to implement, use and explain, it can be
>>>>>>>>>> implemented very quickly (like - in a few hours with tests and
>>>>>>>>>> documentation I think) and performance-wise it is better for any other
>>>>>>>>>> solution (including AIP-46) providing that the case is limited to different
>>>>>>>>>> Python dependencies.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  But possibly there are things that I missed. It all looks
>>>>>>>>>> too good to be true, and I wonder why we do not have it already today -
>>>>>>>>>> once I thought about it, it seems very obvious. So I probably missed
>>>>>>>>>> something.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  WDYT?
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>  J.
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>>>
>>>>>>>>>> >>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Collin McNulty
>>>>>>>>> Lead Airflow Engineer
>>>>>>>>>
>>>>>>>>> Email: collin@astronomer.io <jo...@astronomer.io>
>>>>>>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Posted by Jarek Potiuk <ja...@potiuk.com>.
Fine :). Let it be then :)

On Thu, Sep 1, 2022 at 6:40 AM Abhishek Bhakat
<ab...@astronomer.io.invalid> wrote:

> Would like to vote for ExternalPythonOperator.
> Cause usually Virtualenv have symbolic links for python binaries untill
> used —copies to make it fully portable.
> Additionally there is option to use differently compiled python altogether
> (For example pypy <https://www.pypy.org/index.html> or jython
> <https://www.jython.org/>). Naming these "External Pythons" makes more
> sense to me.
>
> Thanks,
> Abhishek
>
> On 31-Aug-2022 at 9:30:42 PM, Ash Berlin-Taylor <as...@apache.org> wrote:
>
>> Personally if those two I greatly prefer ExternalPythonOperator. (I
>> didn't vote for either of those)
>>
>> (Also I think PythonExternalEnvOperator would be the "correct" casing,
>> Virtualenv is a thing in python, Externalenv isn't.)
>>
>> -ash
>>
>> On 31 August 2022 21:28:20 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>> We've got 56 votes (wow!)
>>>
>>> ExternalPythonOperator won. It got 41% . Followed by
>>> PythonExternalenvOperator 30% and PythonRunenvOperator with 26%.
>>>
>>> I am fine with either of those. But - despite slightly lower support - I
>>> think PythonExternalenvOperator reflects a bit better the resemblance to
>>> PythonVirtualenvOperator that I think is important.
>>>
>>> Asking those who were very strong on ExternalPythonOperator - is
>>> PythonExternalenvOperator "good enough" for you as well?
>>>
>>> The poll had only one option to choose from, but if that is an
>>> acceptable option for those who favoured "ExternalPythonOperator" - I have
>>> personally a slight preference for that one.
>>>
>>> J.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 31, 2022 at 3:10 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> Just 5 hours left to change the world!
>>>>
>>>> You can become one of the people who influenced the decision on naming
>>>> the new operator :D
>>>>
>>>> https://twitter.com/jarekpotiuk/status/1563602012100767746
>>>>
>>>> (Right, maybe changing the world just a little, but still)
>>>>
>>>> J.
>>>>
>>>>
>>>> On Sat, Aug 27, 2022 at 9:01 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> Seems we are only now at the stage that we need to choose the best
>>>>> name for the operator
>>>>>
>>>>> I started a name poll on Twitter :)
>>>>>
>>>>> https://twitter.com/jarekpotiuk/status/1563602012100767746
>>>>>
>>>>> PR here: https://github.com/apache/airflow/pull/25780
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 18, 2022 at 1:53 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>
>>>>>> Draft PR - needs some more tests and review with typing changes - in
>>>>>> https://github.com/apache/airflow/pull/25780
>>>>>> Eventually PythonExternalOperator seems like a good name.
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 17, 2022 at 10:37 PM Jeambrun Pierre <
>>>>>> pierrejbrun@gmail.com> wrote:
>>>>>>
>>>>>>> I also like the ability to use a specific interpreter.
>>>>>>>
>>>>>>> Maybe we could leave everything that is env related to the PVO (even
>>>>>>> using an existing one) and let another one handle the interpreter.
>>>>>>>
>>>>>>> As Ash mentioned I also feel like an additional parameter
>>>>>>> (python/interpreter etc.) to the PO would make sense and is quite intuitive
>>>>>>> rather than a complete new operator, but it might be harder to implement.
>>>>>>>
>>>>>>> Best
>>>>>>> Pierre Jeambrun
>>>>>>>
>>>>>>> Le mer. 17 août 2022 à 20:46, Collin McNulty
>>>>>>> <co...@astronomer.io.invalid> a écrit :
>>>>>>>
>>>>>>>> I concur that this would be very useful. I can see a common pattern
>>>>>>>> being to have a task to create an environment if it does not already exist
>>>>>>>> and then subsequent tasks use that environment.
>>>>>>>>
>>>>>>>> On Wed, Aug 17, 2022 at 12:30 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sounds like this is really in the middle between PVO and PO :).
>>>>>>>>>
>>>>>>>>> BTW. I spoke with a customer of mine today and they said they would
>>>>>>>>> ABSOLUTELY love it. They were actually blocked from migrating to
>>>>>>>>> 2.3.3
>>>>>>>>> because one of their teams needed a DBT environment while the other
>>>>>>>>> team needed some other dependency and they are conflicting with
>>>>>>>>> each
>>>>>>>>> other. They are using Nomad + Docker already and while extending
>>>>>>>>> the
>>>>>>>>> image with another venv is super-easy for them, they were
>>>>>>>>> considering
>>>>>>>>> building several Docker images to serve their users but it is an
>>>>>>>>> order
>>>>>>>>> of magnitude more complex problem for them because they would have
>>>>>>>>> to
>>>>>>>>> make a whole new pipeline to build a distribute multiple images and
>>>>>>>>> implements queue-base split between the teams or switch to using
>>>>>>>>> DockerOperator.
>>>>>>>>>
>>>>>>>>> This one will allow them to do limited version of multi-tenancy for
>>>>>>>>> their teams - without the actual separation but with even more
>>>>>>>>> fine-grained separation of envs - because they would be able to use
>>>>>>>>> different deps even for different tasks in the same DAG.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> J,
>>>>>>>>>
>>>>>>>>> On Wed, Aug 17, 2022 at 6:21 PM Ash Berlin-Taylor <as...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > Another option would be to change the PythonOperator/@task to
>>>>>>>>> take a `python` argument (which also does change the behaviour of _that_
>>>>>>>>> operator a lot with or without that argument if we did that.)
>>>>>>>>> >
>>>>>>>>> > On 17 August 2022 15:46:52 BST, Jarek Potiuk <ja...@potiuk.com>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Yeah. TP - I like that explicit separation. It's much cleaner.
>>>>>>>>> I still
>>>>>>>>> >> have to think about the name though. While I see where
>>>>>>>>> >> ExternalPythonOperator comes from,  It sounds a bit less than
>>>>>>>>> obvious.
>>>>>>>>> >> I think the name should somehow contain "Environment" because
>>>>>>>>> very few
>>>>>>>>> >> people realise that running Python from a virtualenv actually
>>>>>>>>> >> implicitly "activates" the venv.
>>>>>>>>> >> I think maybe deprecating the old PythonVirtualenvOperator and
>>>>>>>>> >> introducing two new operators:
>>>>>>>>> PythonInCreatedVirtualEnvOperator,
>>>>>>>>> >> PythonInExistingVirtualEnvOperator ? Not exactly those names -
>>>>>>>>> they
>>>>>>>>> >> are too long - but something like that. Maybe we should get rid
>>>>>>>>> of
>>>>>>>>> >> Python in the name at all ?
>>>>>>>>> >>
>>>>>>>>> >> BTW. I think we should generally do more of the discussions
>>>>>>>>> here and
>>>>>>>>> >> express our thoughts about Airflow here. Even if there are no
>>>>>>>>> answers
>>>>>>>>> >> or interest immediately, I think that it makes sense to do a
>>>>>>>>> bit of a
>>>>>>>>> >> melting pot that sometimes might produce some cool (or rather
>>>>>>>>> hot)
>>>>>>>>> >> stuff as a result.
>>>>>>>>> >>
>>>>>>>>> >> On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung
>>>>>>>>> <tp...@astronomer.io.invalid> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>>  One thing I thought of (but never bothered to write about) is
>>>>>>>>> to introduce a separate operator instead, say ExternalPythonOperator (bike
>>>>>>>>> shedding on name is welcomed), that explicitly takes a path to the
>>>>>>>>> interpreter (say in a virtual environment) and just use that to run the
>>>>>>>>> code. This also enables users to create a virtual environment upfront, but
>>>>>>>>> avoids needing to overload PythonVirtualenvOperator for the purpose. This
>>>>>>>>> also opens an extra use case that you can use any Python installation to
>>>>>>>>> run the code (say a custom-compiled interpreter), although nobody asked
>>>>>>>>> about that.
>>>>>>>>> >>>
>>>>>>>>> >>>  TP
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>>  On 13 Aug 2022, at 02:52, Jeambrun Pierre <
>>>>>>>>> pierrejbrun@gmail.com> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>>  I feel like this is a great alternative at the price of a
>>>>>>>>> very moderate effort. (I'd be glad to help with it).
>>>>>>>>> >>>
>>>>>>>>> >>>  Mutually exclusive sounds good to me as well.
>>>>>>>>> >>>
>>>>>>>>> >>>  Best,
>>>>>>>>> >>>  Pierre
>>>>>>>>> >>>
>>>>>>>>> >>>  Le ven. 12 août 2022 à 15:23, Jarek Potiuk <ja...@potiuk.com>
>>>>>>>>> a écrit :
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>>  Mutually exclusive. I think that has the nice property of
>>>>>>>>> forcing people to prepare immutable venvs upfront.
>>>>>>>>> >>>>
>>>>>>>>> >>>>  On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor <
>>>>>>>>> ash@apache.org> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Yes, this has been on my background idea list for an age --
>>>>>>>>> I'd love to see it happen!
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Have you thought about how it would behave when you specify
>>>>>>>>> an existing virtualenv and include requirements in the operator that are
>>>>>>>>> not already installed there? Or would they be mutually exclusive? (I don't
>>>>>>>>> mind either way, just wondering which way you are heading)
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  -ash
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk <
>>>>>>>>> jarek@potiuk.com> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Hello everyone,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  TL;DR; I propose to extend our PythonVirtualenvOperator
>>>>>>>>> with "use existing venv" feature and make it a viable way of handling some
>>>>>>>>> multi-dependency sets using multiple pre-installed venvs.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  More context:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  I had this idea coming after a discussion in our Slack:
>>>>>>>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  My thoughts were - why don't we add support for "use
>>>>>>>>> existing venv" in PythonVirtualenvOperator as first-class-citizen ?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Currently (unless there are some tricks I am not aware of)
>>>>>>>>> or extend PVO, the PVO will always attempt to create a virtualenv based on
>>>>>>>>> extra requirements. And while it gives the users a possibility of having
>>>>>>>>> some tasks use different dependencies, the drawback is that the venv is
>>>>>>>>> created dynamically when tasks starts - potentially a lot of overhead for
>>>>>>>>> startup time and some unpleasant failure scenarios - like networking
>>>>>>>>> problems, PyPI or local repoi not available, automated (and unnoticed)
>>>>>>>>> upgrade of dependencies.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Those are basically the same problems that caused us to
>>>>>>>>> strongly discourage our users in our Helm Chart to use
>>>>>>>>> _PIP_ADDITIONAL_DEPENDENCIES in production and criticize the  Community
>>>>>>>>> Helm Chart for dynamic dependency installation they promote as a "valid"
>>>>>>>>> approach. Yet our PVO currently does exactly this.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  We had some past discussions how this can be improved -
>>>>>>>>> with caching, or using different images for different dependencies and
>>>>>>>>> similar - and even we have
>>>>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing
>>>>>>>>> proposal to use different images for different sets of requirements.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Proposal:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  During the discussion yesterday I started to think a
>>>>>>>>> simpler solution is possible and rather simple to implement by us and for
>>>>>>>>> users to use.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Why not have different venvs preinstalled and let the PVO
>>>>>>>>> choose the one that should be used?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  It does not invalidate AIP-46. AIP-46 serves a bit
>>>>>>>>> different purpose and some cases cannot be handled this way - when you need
>>>>>>>>> different "system level" dependencies for example) but it might be much
>>>>>>>>> simpler from deployment point of view and allow it to handle
>>>>>>>>> "multi-dependency sets" for Python libraries only with minimal deployment
>>>>>>>>> overhead (which AIP-46 necessarily has). And I think it will be enough for
>>>>>>>>> a vast number of the "multi-dependency-sets" cases.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Why don't we allow the users to prepare those venvs upfront
>>>>>>>>> and simply enable PVE to use them rather than create them dynamically ?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Advantages:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * it nicely handles cases where some of your tasks need a
>>>>>>>>> different set of dependencies than others (for execution, not necessarily
>>>>>>>>> parsing at least initially).
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * no startup time overhead needed as with current PVO
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * possible to run in both cases - "venv installation" and
>>>>>>>>> "docker image" installation
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * it has finer granularity level than AIP-46 - unlike in
>>>>>>>>> AIP-46 you could use different sets of dependencies
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * very easy to pull off for the users without modifying
>>>>>>>>> their deployments,For local venv, you just create the venvs, For Docker
>>>>>>>>> image case, your custom image needs to add several lines similar to:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  RUN python -m venv --system-site-packages PACKAGE1==NN
>>>>>>>>> PACKAGE2==NN /opt/venv1
>>>>>>>>> >>>>>  RUN python -m venv --system-site-packages PACKAGE1==NN
>>>>>>>>> PACKAGE2==NN /opt/venv2
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  and PythonVenvOperator should have extra
>>>>>>>>> "use_existing_venv=/opt/venv2") parameter
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * we only need to manage ONE image (!) even if you have
>>>>>>>>> multiple sets of dependencies (this has the advantage that it is actually
>>>>>>>>> LOWER overhead than having separate images for each env -when it comes to
>>>>>>>>> various resources overhead (same workers could handle multiple dependency
>>>>>>>>> sets for examples, same image is reused by multiple PODs in K8S etc. ).
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  * later (when AIP-43 (separate dag processor with ability
>>>>>>>>> to use different processors for different subdirectories) is completed and
>>>>>>>>> AIP-46 is approved/implemented, we could also extend DAG Parsing to be able
>>>>>>>>> to use those predefined venvs for parsing. That would eliminate the need
>>>>>>>>> for local imports and add support to even use different sets of libraries
>>>>>>>>> in top-level code (per DAG, not per task). It would not solve different
>>>>>>>>> "system" level dependencies - and for that AiP-46 is still a very valid
>>>>>>>>> case.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  Disadvantages:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  I thought very hard about this one and I actually could not
>>>>>>>>> find any disadvantages :)
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  It's simple to implement, use and explain, it can be
>>>>>>>>> implemented very quickly (like - in a few hours with tests and
>>>>>>>>> documentation I think) and performance-wise it is better for any other
>>>>>>>>> solution (including AIP-46) providing that the case is limited to different
>>>>>>>>> Python dependencies.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  But possibly there are things that I missed. It all looks
>>>>>>>>> too good to be true, and I wonder why we do not have it already today -
>>>>>>>>> once I thought about it, it seems very obvious. So I probably missed
>>>>>>>>> something.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  WDYT?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>  J.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Collin McNulty
>>>>>>>> Lead Airflow Engineer
>>>>>>>>
>>>>>>>> Email: collin@astronomer.io <jo...@astronomer.io>
>>>>>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>>>>>>>
>>>>>>>>
>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>
>>>>>>>