You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/12/29 13:16:22 UTC

[GitHub] [airflow] zacharya19 opened a new issue #13364: Add virtual env creation in Celery workers

zacharya19 opened a new issue #13364:
URL: https://github.com/apache/airflow/issues/13364


   **Description**
   
   Add the option for the celery worker to create a new virtual env, install some packages, and run airflow run command inside it (based on `executor_config` params).
   Really nice to have - have reusable virtual env that can be shared between tasks with the same param (based on user configuration).
   
   **Use case / motivation**
   
   Once getting to a point when you want to create cluster for different types of python tasks and you've multiple teams working on the same cluster, you need to start splitting into different python packages the business login code to allow better versioning controls and avoid the need of restarting the workers when deploying new util code.
   I think it will be amazing if we can allow creating new virtual envs as part of Airflow and control the package versions.
   
   I know that `PythonVirtualenvOperator` exists, but:
   1. Creating env related thing feels like an executor job to me, the coder should not use specific operators for it.
   2. The big downside to it is that if I want to use `ShortCircuitOperator` or `BranchPythonOperator` or any kind of new python based operator, I've to create a new operator that will inherit from `PythonVirtualenvOperator` and duplicate the desired functionality.
   
   **Are you willing to submit a PR?**
   
   Yes, would love to.
   
   **Related Issues**
   
   Not that I can find.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752947051


   > To be honest, I think the best approach here is to not solve it and let the user know about this behavior, and I don't think this is such a bad behavior, as if a task has started already and the there is now a new version - you wouldn't stop the task to rerun, it make sense it will run and the next task will use the new version.
   
   @zacharya19  Can you please describe in detail the algorithm you want to propose for upgrading/managing the venv? The one proposed does not describe the details. When it should happen ? Are you going to try to upgrade to newer dependencies always when new task is run? Which component of the process is doing it? How it plays out with other tasks running at the same time and possibly using the same venv at the time of upgrade? I think you heavily underestimate the complexity involved in the upgrade/creation of the venvs. Remember that this is not only running on different workers. LocalExcecutor and CeleryExecutor will run multiple tasks in parallel on the same machine. Should they share the same venv? What happens in different tasks have different "pinned" versions specified for example? Will they override each other venv? Or when one task has a pinned version and the other not - what happens then?
   
   I think this requires at least a design doc describing the whole algorithm (and likely it already require creating and discussing An Airflow Improvement Proposal (https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals) and collective discussion on all such cases in the devlist.
   
   > I would love to see some caching mechanism in Airflow for other things as well (shared Java JARs for example), but a generic cache solution IMHO will only complex this feature, as right now it's out of the box (create venv if the folder doesn't exists, run pip install, and that's it, something like: `Operator(..., executor_config={"CeleryExecutor": {"venv_name": "test", "packages": ["internal=>1,<2"])`).
   
   This will not work in a number of cases described above. I think when you look at  all the scenarios and edge-cases involved, you will find out that you have the same complexity as any other caching solution and we will end up having to solve them, so it's likely better to solve them as 'general' case.
   
   Remember the quote?
   
   > There are only two hard things in Computer Science: cache invalidation and naming things.
   > -- Phil Karlton
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753322802


   > Can you please describe in detail the algorithm you want to propose for upgrading/managing the venv?
   
   As a first step, I suggest a simple solution:
   When the celery worker is getting a new task to run, it will extract the venv config from `executor_config` (`venv_name: str` and `packages: List[str]`) and if it exists it will first run a simple flock based on a file (`/tmp/{venv_name}`), run virtualenv creation if doesn't exists (which will include system-site-packages/copy another venv as discussed) and then run `pip install --upgrade` for all packages (always, which will slow it a bit) and release the lock.
   After that - activate the venv and same process as usual (`airflow run ...`).
   
   This solution is fairly simple to implement and covers the base issues I described in the issue.
   As for other tasks overriding versions and messing around - I think that as a first step for this feature it's the user's problem (same as Airflow doesn't validate all workers running with the same Airflow version) and we can maybe provide a warning to let them know, the user can make sure to either pin the same version in all of the tasks (maybe by using Airflow variables), or always use latest.
   I understand that this puts a lot of the user to make it work, but it's an easy solution without hard work.
   
   > so it's likely better to solve them as 'general' case.
   
   I'm all down to start thinking on a generic solution, but I don't think that this feature has to be bulletproof, so if you still think my solution isn't good enough I can try think on something else or maybe purpose a solution for a generic caching manager.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753343945


   I suggest you write it to the devlist and propose it as new feature there, I'd love to hear what others think about it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752633126


   What we are talking about is caching some resources between task runs (in this case .venv). The problems with that are that you introduce indeterministic approach. Initially execution time depends  on whether the .venv is there or not. But it's more than that. 
   
   What happens if new dependencies are released in the meantime? Do we care about having the exact same version of the depedencies between diffferent .venvs ? What happens if you have two versions of the same .venv in different workers and your task executes either on the older or newer version (or is executed on one, fails and then gets retried on the other)? 
   
   As the result - should we always require pinning the versions? Should we invalidate the cached .venv if the "pinned" version changed. How do we determine when we should change and rebuild the .venv. I am not saying it is impossible to solve, I am just saying that it needs to be rather carefully thought about. More than that - maybe we do not have to think that much and can apply some best practices out there. This is actually very similar to what https://github.com/actions/cache (and any other CI) do. 
   
   And maybe  this is something we should do as well - introduce a generic caching approach that we will be able to apply to .venv  or any other "cacheable" resource. We've been discussing other optimisations and concept of "task affiinity". We've already discussed it several times that there should be a way of sharing some "cached" state between different tasks of the same kind and reusing that state. For example machine learning models for workers that have GPUs - where the same model loaded by one task could be reused by many other tasks (and save time for model loading which is often a lengthy operation because CPU and GPU memory is not shared - it's only sharedin the new M1 from Apple). 
   
   I am not talking we should do it all together, but I see that  caching/affinity share a lot of common problems (and solution) and maybe we should think broader and propose a general caching/affinity  solution that could be applied to other cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752257146


   I think this is a bit bigger task to tackle. but an interesting candidate to discuss. I remember discussion at the devlist where we discussed different options how to do it (preparing wheel packages for tasks was one of those for example).
   
   However this idea is much simpler and can be implemented much easier and I think it might serve the purpose you described pretty well. It does not include a lot of changes. It would be rather similar to what cgroup_task_runner does already 
   https://github.com/apache/airflow/blob/master/airflow/task/task_runner/cgroup_task_runner.py
   
   Conceptualy it could be a "virtualenv_task_runner" but it would launch a process in another virtualenv (and create the venv if it does not exist before) - very similar to what cgroup_task_runner does when creating a new cgroup process. 
   
   I think there is possibly non-trivial work, though with dependency management of airlfow itself when crossed with the other requirements you want to add, because what you are really looking at is not only to create a virtualenv itself and add your packaages there, but you also have to make sure that airflow is installed in this virtualenv including some (possibly all ? ) packages that come with airflow. This is the only way to make sure that all operators will work within such environment.
   
   But maybe if we could assume that we copy "all" standard airflow dependencies and add only "extra" ones might be different, that might work rather well and can be implemented easily. I think python venv will install symlinks to the packages by default when they are already installed in the "main" environment so this might even be pretty effficient in terms of space used.
   
   I wonder what others think about it @kaxil @ashb @mik-laj @turbaszek - I know you might now some potential problems with that approach, but first look it looks rather reasonable and rather easy to implement ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756675527


   Still haven't got the access so I'm just going to write down another suggestion I had an mind -
   Instead of making venv as part of the DAG, we can treat this feature similar to connections, which means the user will be able to create a venv object in the UI (/API) that will contain the name and list of requirements.
   Afterwards, when a task starts running, it will check if the task has "venv_name" param configured and run the same thing I suggested above (flock, create venv if doesn't exists, install deps, and run `airflow run` inside the venv).
   
   That way, we gain a much more bullet-proof solution while still giving the user the option to control it inside Airflow without the need of creating worker per venv.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753344621


   Just to expolain - I have my opinion (voiced above) and I thin the indeterministic behaviour will lead to many questions from the users and hard debugging which will be hard to answer. Personally I think we need a more bullet-proof solution.
   
   This is a new feature to add and it should be discussed in the devlist (like anything else of this caliber). 
   
   We are really community-driven and no single person can make a decision. For me this sounds like an interesting feature, the level of "buletproofness" of it should be discussed in the devlist and I would love to hear what others say,
   
   Whatever community decide, I am happy with..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753322802


   > Can you please describe in detail the algorithm you want to propose for upgrading/managing the venv?
   
   As a first step, I suggest a simple solution:
   When the celery worker is getting a new task to run, it will extract the venv config from `executor_config` (`venv_name: str` and `packages: List[str]`) and if it exists it will first run a simple flock based on a file (`/tmp/{venv_name}`), run virtualenv creation if doesn't exists (which will include system-site-packages/copy another venv as discussed) and then run `pip install --upgrade` for all packages (always, which will slow it a bit) and release the lock.
   After that - activate the venv and same process as usual (`airflow run ...`).
   
   This solution is fairly simple to implement and covers the base issues I described in the issue.
   As for other tasks overriding versions and messing around - I think that as a first step for this feature it's the user's problem (same as Airflow doesn't validate all workers running with the same Airflow version) and we can maybe provide a warning to let them know, the user can make sure to either pin the same version in all of the tasks (maybe by using Airflow variables), or always use latest.
   I understand that this puts a lot of the user to make it work, but it's an easy solution without hard work.
   
   > so it's likely better to solve them as 'general' case.
   
   I'm all down to start thinking on a generic solution, but I don't think that this feature has to be bulletproof, so if you still think my solution isn't good enough I can try think on something else or maybe propose a solution for a generic caching manager.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752360736


   My first "off the cuff" thought is that this would be something that lives outside Airflow, and it's just the context you run the celery workers in.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752633126


   What we are talking about is caching some resources between task runs (in this case .venv). The problems with that are that you introduce indeterministic approach. Initially execution time depends  on whether the .venv is there or not. But it's more than that. 
   
   What happens if new dependencies are released in the meantime? Do we care about having the exact same version of the depedencies between diffferent .venvs ? What happens if you have two versions of the same .venv in different workers and your task executes either on the older or newer version (or is executed on one, fails and then gets retried on the other)? 
   
   As the result - should we always require pinning the versions? Should we invalidate the cached .venv if the "pinned" version changed. How do we determine when we should change and rebuild the .venv. 
   
   I am not saying it is impossible to solve, I am just saying that it needs to be rather carefully thought about. More than that - maybe we do not have to think that much and can apply some best practices out there. This is actually very similar to what https://github.com/actions/cache (and any other CI) do. 
   
   And maybe  this is something we should do as well - introduce a generic caching approach that we will be able to apply to .venv  or any other "cacheable" resource. We've been discussing other optimisations and concept of "task affiinity". We've already discussed it several times that there should be a way of sharing some "cached" state between different tasks of the same kind and reusing that state. For example machine learning models for workers that have GPUs - where the same model loaded by one task could be reused by many other tasks (and save time for model loading which is often a lengthy operation because CPU and GPU memory is not shared - it's only sharedin the new M1 from Apple). 
   
   I am not talking we should do it all together, but I see that  caching/affinity share a lot of common problems (and solution) and maybe we should think broader and propose a general caching/affinity  solution that could be applied to other cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752947051


   > To be honest, I think the best approach here is to not solve it and let the user know about this behavior, and I don't think this is such a bad behavior, as if a task has started already and the there is now a new version - you wouldn't stop the task to rerun, it make sense it will run and the next task will use the new version.
   
   @zacharya19  Can you please describe in detail the algorithm you want to propose for upgrading/managing the venv? The one proposed does not describe the details. When it should happen ? Are you going to try to upgrade to newer dependencies always when new task is run? Which component of the process is doing it? How it plays out with other tasks running at the same time and possibly using the same venv at the time of upgrade? I think you heavily underestimate the complexity involved in the upgrade/creation of the venvs. Remember that this is not only problem of different venvs running on different workers. LocalExcecutor and CeleryExecutor will run multiple tasks in parallel on the same machine. Should they share the same venv? What happens in different tasks have different "pinned" versions specified for example? Will they override each other venv? Or when one task has a pinned version and the other not - what happens then?
   
   I think this requires at least a design doc describing the whole algorithm (and likely it already require creating and discussing An Airflow Improvement Proposal (https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals) and collective discussion on all such cases in the devlist.
   
   > I would love to see some caching mechanism in Airflow for other things as well (shared Java JARs for example), but a generic cache solution IMHO will only complex this feature, as right now it's out of the box (create venv if the folder doesn't exists, run pip install, and that's it, something like: `Operator(..., executor_config={"CeleryExecutor": {"venv_name": "test", "packages": ["internal=>1,<2"])`).
   
   This will not work in a number of cases described above. I think when you look at  all the scenarios and edge-cases involved, you will find out that you have the same complexity as any other caching solution and we will end up having to solve them, so it's likely better to solve them as 'general' case.
   
   Remember the quote?
   
   > There are only two hard things in Computer Science: cache invalidation and naming things.
   > -- Phil Karlton
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752071023


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756750850


   @potiuk Didn't get the response email, weird, but anyway sent my Wiki ID now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752947051


   > To be honest, I think the best approach here is to not solve it and let the user know about this behavior, and I don't think this is such a bad behavior, as if a task has started already and the there is now a new version - you wouldn't stop the task to rerun, it make sense it will run and the next task will use the new version.
   
   @zacharya19  Can you please describe in detail the algorithm you want to propose for upgrading/managing the venv? The one proposes  does not describe the details. When it should happen ? Are you going to try to upgrade to newer dependencies always when new task is run? Which component of the process is doing it? How it plays out with other tasks running at the same time and possibly using the same venv at the time of upgrade? I think you heavily underestimate the complexity involved in the upgrade/creation of the venvs. Remember that this is not only running on different workers. LocalExcecutor and CeleryExecutor will run multiple tasks in parallel on the same machine. Should they share the same venv? What happens in different tasks have different "pinned" versions specified for example? Will they override each other venv? Or when one task has a pinned version and the other not - what happens then?
   
   I think this requires at least a design doc describing the whole algorithm (and likely it already require creating and discussing An Airflow Improvement Proposal (https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals) and collective discussion on all such cases in the devlist.
   
   > I would love to see some caching mechanism in Airflow for other things as well (shared Java JARs for example), but a generic cache solution IMHO will only complex this feature, as right now it's out of the box (create venv if the folder doesn't exists, run pip install, and that's it, something like: `Operator(..., executor_config={"CeleryExecutor": {"venv_name": "test", "packages": ["internal=>1,<2"])`).
   
   This will not work in a number of cases described above. I think when you look at  all the scenarios and edge-cases involved, you will find out that you have the same complexity as any other caching solution and we will end up solving them.
   
   Remember the quote?
   
   > There are only two hard things in Computer Science: cache invalidation and naming things.
   > -- Phil Karlton
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756825749


   > You can get the 'how to subscribe' info here: https://airflow.apache.org/community/
   
   Thanks, subscribed and created the AIP - https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-37+Virtualenv+management+inside+Airflow.
   
   Also, sent an email to start a discussion on my solution.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752257146


   I think this is a bit bigger task to tackle. but an interesting candidate to discuss. I remember discussion at the devlist where we discussed different options how to do it (preparing wheel packages for tasks was one of those for example).
   
   However this idea is much simpler and can be implemented much easier and I think it might serve the purpose you described pretty well. It does not include a lot of changes. It would be rather similar to what cgroup_task_runner does already 
   https://github.com/apache/airflow/blob/master/airflow/task/task_runner/cgroup_task_runner.py
   
   Conceptualy it could be a "virtualenv_task_runner" but it would launch a process in another virtualenv (and create the venv if it does not exist before) - very similar to what cgroup_task_runner does when creating a new cgroup process. 
   
   I think there is possibly non-trivial work, though with dependency management of airlfow itself when crossed with the other requirements you want to add, because what you are really looking at is not only to create a virtualenv itself and add your packaages there, but you also have to make sure that airflow is installed in this virtualenv including some (possibly all ? ) packages that come with airflow. This is the only way to make sure that all operators will work within such environment.
   
   But maybe if we could assume that we copy "all" standard airflow dependencies and add only the "extra" ones might be different between different venvs, that might work rather well and can be implemented easily. I think python venv will install symlinks to the packages by default when they are already installed in the "main" environment so this might even be pretty effficient in terms of space used.
   
   I wonder what others think about it @kaxil @ashb @mik-laj @turbaszek - I know you might now some potential problems with that approach, but first look it looks rather reasonable and rather easy to implement ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756748744


   > Still haven't got the access 
   
   I believe somoene ask you for your Id to give you access (in response to your mail) - did you send it?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752360736


   My first "off the cuff" thought is that this would be something that lives outside Airflow, and it's just the context you run the celery workers in.
   
   The venv task runner idea does sound like a good way of tackling this of we do want to include it - but it would be slow. On reading the issue subject my first thought was:
   
   - create a venv before hand
   - run `airflow celery worker -q myvenv`
   - use the `queue` task attribute to run the task in the right place.
   
   It depends how many venvs you wanted I guess


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753460082


   > This seems to be more reliable approach (managing deps on deployment level) than managing dependencies via application (Airflow) which main task is scheduling and executing tasks.
   
   As I said, the issue here is that the user have to create some kind of complex deployment scripts, because running airflow inside venv requires restart when updating the packages.
   Anyway, I had a few extra ideas to make this feature more reliable, I sent a request to get edit access to the wiki and will write a more detailed proposal when I do. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753446898


   > Once getting to a point when you want to create cluster for different types of python tasks and you've multiple teams working on the same cluster, you need to start splitting into different python packages the business login code to allow better versioning control and avoid the need of restarting the workers when deploying new util code.
   
   In my opinion if you start to have multiple teams working on the same Airflow deployment then virtualenv solves the problem only on Python level which in my opinion is not sufficient. I personally would be in favour of worker/queue per env/team (as @ashb suggested). This seems to be more reliable approach (managing deps on deployment level) than managing dependencies via application (Airflow) which main task is scheduling and executing tasks.
   
   But as @potiuk said - we should move this discussion to devlist to gather wider feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756825749


   > You can get the 'how to subscribe' info here: https://airflow.apache.org/community/
   
   Thanks, subscribed and created the AIP - https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-37+Virtualenv+management+inside+Airflow.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752578580


   @potiuk 
   > You need to have at least airflow package + a number of dependent packages
   
   You're right, that's why I'm suggesting to allow the user chose between two methods:
   1. System site packages symlink.
   2. Virtualenv copy.
   
   I don't want that this feature will make Airflow launcheable only by using venv or global env.
   
   @ashb 
   
   > but it would be slow.
   
   Well, yes and no.
   I don't suggest creating a new venv every time I want to run a task, I think creating reusable venvs is a good approach and then the worker will only make sure the right versions are installed, so only the first time you run the task it will be slow.
   The user will be able to control which tasks will share venv.
   
   > The venv task runner idea does sound like a good way of tackling this of we do want to include it - but it would be slow. On reading the issue subject my first thought was:
   > 
   >     * create a venv before hand
   > 
   >     * run `airflow celery worker -q myvenv`
   > 
   >     * use the `queue` task attribute to run the task in the right place.
   
   There is a major issue with that approach because it means it will be harder to upgrade versions, it will require the user to build some kind of deployment script that will shutdown the worker gracefully and launch a new one every time they want to upgrade any package.
   By using my approach where the worker will launch a new process inside the venv it will allow the user to control the package versions easily without the need of any restart.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752271420


   > Yeah, as long as we use `system-site-packages` option it can work, but it will force the user to work with airflow as a system site package and not virtual env if they want, so that could be problematic.
   
   The problem is that without 'airflow' in the venv  NONE of the operators will work. You need to have at least airflow package + a number of dependent packages, otherwise airflow's in-task functionalities like jinja templating, variables, xcom and a number of others will not work. Also if you would like to use any of the "provider" operators, they won't work either without importing the providers (and dependent) packages. That's what I meant by the non-trivial work if you want to limit that and only install 'airflow and few more packages' - determining which packages to install in such venv is far from trivial and error-prone. So I think the only viable solution is to clone the current "airflow" installation and then you can have different venvs for different extra packages.
   
   Otherwise (if you do not want to install airflow + dependend packages) you should simply use PythonVirtualenv operator as you mentioned above. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-753344621


   Just to comment - I have my opinion (voiced above) and I thin the indeterministic behaviour will lead to many questions from the users and hard debugging which will be hard to answer. Personally I think we need a more bullet-proof solution.
   
   This is a new feature to add and it should be discussed in the devlist (like anything else of this caliber). 
   
   We are really community-driven and no single person can make a decision. For me this sounds like an interesting feature, the level of "buletproofness" of it should be discussed in the devlist and I would love to hear what others say,
   
   Whatever community decide, I am happy with..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756760563


   > @potiuk Didn't get the response email, weird, but anyway sent my Wiki ID now.
   
   @zacharya19  - maybe you are not subscribed to the devlist? By default when you even 'reply to all' at the devlist, the default reply to is the devlist only. So if you are not subscribed, you might not get the answer:
   
   You can get the 'how to subscribe' info here: https://airflow.apache.org/community/
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756748744






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752264982


   >  But maybe if we could assume that we copy "all" standard airflow dependencies and add only the "extra" ones might be different between different venvs, that might work rather well and can be implemented easily. I think python venv will install symlinks to the packages by default when they are already installed in the "main" environment so this might even be pretty effficient in terms of space used.
   
   Yeah, as long as we use `system-site-packages` option it can work, but it will force the user to work with airflow as a system site package and not virtual env if they want, so that could be problematic.
   Another option would be to allow the user to control it by specify the airflow virtualenv (or it can be detected during runtime) and use some virtualenv tools to duplicate it (such as virtualenvwrapper).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752762144


   > What happens if new dependencies are released in the meantime? Do we care about having the exact same version of the depedencies between diffferent .venvs ? What happens if you have two versions of the same .venv in different workers and your task executes either on the older or newer version (or is executed on one, fails and then gets retried on the other)?
   
   To be honest, I think the best approach here is to not solve it and let the user know about this behavior, and I don't think this is such a bad behavior, as if a task has started already and the there is now a new version - you wouldn't stop the task to rerun, it make sense it will run and the next task will use the new version.
   If the user desire deterministic behavior here, I think it's reasonable to ask them to pin all their versions, and it will be something Airflow will not control (/force).
   
   > I am not talking we should do it all together, but I see that caching/affinity share a lot of common problems (and solution) and maybe we should think broader and propose a general caching/affinity solution that could be applied to other cases.
   
   I would love to see some caching mechanism in Airflow for other things as well (shared Java JARs for example), but a generic cache solution IMHO will only complex this feature, as right now it's out of the box (create venv if the folder doesn't exists, run pip install, and that's it, something like: `Operator(..., executor_config={"CeleryExecutor": {"venv_name": "test", "packages": ["internal=>1,<2"])`).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 commented on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 commented on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756675527






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-752947051


   > To be honest, I think the best approach here is to not solve it and let the user know about this behavior, and I don't think this is such a bad behavior, as if a task has started already and the there is now a new version - you wouldn't stop the task to rerun, it make sense it will run and the next task will use the new version.
   
   @zacharya19  Can you please describe in detail the algorithm you want to propose for upgrading/managing the venv? The one proposes  does not describe the details. When it should happen ? Are you going to try to upgrade to newer dependencies always when new task is run? Which component of the process is doing it? How it plays out with other tasks running at the same time and possibly using the same venv at the time of upgrade? I think you heavily underestimate the complexity involved in the upgrade/creation of the venvs. Remember that this is not only running on different workers. LocalExcecutor and CeleryExecutor will run multiple tasks in parallel on the same machine. Should they share the same venv? What happens in different tasks have different "pinned" versions specified for example? Will they override each other venv? Or when one task has a pinned version and the other not - what happens then?
   
   I think this requires at least a design doc describing the whole algorithm (and likely it already require creating and discussing An Airflow Improvement Proposal (https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals) and collective discussion on all such cases in the devlist.
   
   > I would love to see some caching mechanism in Airflow for other things as well (shared Java JARs for example), but a generic cache solution IMHO will only complex this feature, as right now it's out of the box (create venv if the folder doesn't exists, run pip install, and that's it, something like: `Operator(..., executor_config={"CeleryExecutor": {"venv_name": "test", "packages": ["internal=>1,<2"])`).
   
   This will not work in a number of cases described above. I think when you look at  all the scenarios and edge-cases involved, you will find out that you have the same complexity as any other caching solution and we will end up having to solve them, so it's likely better to solve them as 'general' case.
   
   Remember the quote?
   
   > There are only two hard things in Computer Science: cache invalidation and naming things.
   > -- Phil Karlton
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] zacharya19 edited a comment on issue #13364: Add virtual env creation in Celery workers

Posted by GitBox <gi...@apache.org>.
zacharya19 edited a comment on issue #13364:
URL: https://github.com/apache/airflow/issues/13364#issuecomment-756825749


   > You can get the 'how to subscribe' info here: https://airflow.apache.org/community/
   
   Thanks, subscribed and created the AIP - https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-37+Virtualenv+management+inside+Airflow.
   
   Also, sent an email to start a discussion on my solution.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org