You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ping Zhang <pi...@umich.edu> on 2021/12/16 23:52:02 UTC

[DISCUSS] Docker runtime isolation for airflow tasks

Hi Airflow Community,

This is Ping Zhang from the Airbnb Airflow team.  We would like to open
source our internal feature: docker runtime isolation for airflow tasks. It
has been in our production for close to 1 year and it is very stable.

I will create an AIP after the discussion.

Thanks,

Ping


Motivation

Airflow worker host is a shared resource among all tasks running on it.
Thus, it requires hosts to provision dependencies for all tasks, including
system and python application level dependencies. It leads to a very fat
runtime, thus long host provision time and low elasticity in the worker
resource. This makes it challenging to prepare for unexpected burst load,
including a large backfill or a rerun of large DAGs.

The lack of runtime isolation makes it challenging and risky to do
operations, including adding/upgrading system and python dependencies, and
it is almost impossible to remove any dependencies. It also incurs lots of
additional operating costs for the team as users do not have permission to
add/upgrade python dependencies, which requires us to coordinate with them.
When there are package version conflicts, it prevents installing them
directly on the host. Users have to use PythonVirtualenvOperator, which
slows down their development cycle.

What change do you propose to make?

To solve those problems, we propose introducing runtime isolation for
Airflow tasks. It leverages docker as the tasks runtime environment. There
are several benefits:

   1.

   Provide runtime isolation on task level
   2.

   Customize runtime to parse dag files
   3.

   Lean runtime on airflow host, which enables high worker resource
   elasticity
   4.

   Immutable and portable task execution untime
   5.

   Process isolation ensures that all subprocesses of a task are cleaned up
   after docker exits (we have seen some orphaned hive, spark subprocesses
   after the airflow run process exits)

ChangesAirflow Worker

In the new design, the `airflow run local` and `airflow run raw` processes
are running inside a docker container, which is launched by an airflow
worker. In this way, the airflow worker runtime only needs minimum
requirements to run airflow core and docker.
Airflow Scheduler

Instead of processing the DAG file directly, the DagFileProcessor process

   1.

   launches a docker container required by that DAG file to process it and
   persists the serializable DAGs (SimpleDags) to a file so that the result
   can be read outside the docker container
   2.

   reads the file persisted from the docker container, deserializes it and
   puts the result into the multiprocess queue


This ensures the DAG parsing runtime is exactly the same as DAG execution
runtime.

This requires a DAG definition file to tell the DAG file processing loop to
use which docker image to process it. We can easily achieve this by having
a metadata file along with the DAG definition file to define the docker
runtime. To ease the burden of users, a default docker image is provided
when a DAG definition file does not require customized runtime.
As a Whole






Best wishes

Ping Zhang

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Posted by Ping Zhang <pi...@umich.edu>.
Hi Alexander,

Thanks for the inputs. Docker runtime is an add-on feature that is
controlled by a feature flag. It does not force users to use docker to run
tasks.


Best wishes

Ping Zhang


On Fri, Dec 17, 2021 at 2:53 AM Alexander Shorin <kx...@gmail.com> wrote:

> How should your idea work on systems without docker? Like FreeBSD? And why
> you made such leaky tasks which couldn't be isolated with common tools like
> system packages, venv, etc.
>
> --
> ,,,^..^,,,
>
>
> On Fri, Dec 17, 2021 at 2:53 AM Ping Zhang <pi...@umich.edu> wrote:
>
>> Hi Airflow Community,
>>
>> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
>> source our internal feature: docker runtime isolation for airflow tasks. It
>> has been in our production for close to 1 year and it is very stable.
>>
>> I will create an AIP after the discussion.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> Motivation
>>
>> Airflow worker host is a shared resource among all tasks running on it.
>> Thus, it requires hosts to provision dependencies for all tasks, including
>> system and python application level dependencies. It leads to a very fat
>> runtime, thus long host provision time and low elasticity in the worker
>> resource. This makes it challenging to prepare for unexpected burst load,
>> including a large backfill or a rerun of large DAGs.
>>
>> The lack of runtime isolation makes it challenging and risky to do
>> operations, including adding/upgrading system and python dependencies, and
>> it is almost impossible to remove any dependencies. It also incurs lots of
>> additional operating costs for the team as users do not have permission to
>> add/upgrade python dependencies, which requires us to coordinate with them.
>> When there are package version conflicts, it prevents installing them
>> directly on the host. Users have to use PythonVirtualenvOperator, which
>> slows down their development cycle.
>>
>> What change do you propose to make?
>>
>> To solve those problems, we propose introducing runtime isolation for
>> Airflow tasks. It leverages docker as the tasks runtime environment. There
>> are several benefits:
>>
>>    1.
>>
>>    Provide runtime isolation on task level
>>    2.
>>
>>    Customize runtime to parse dag files
>>    3.
>>
>>    Lean runtime on airflow host, which enables high worker resource
>>    elasticity
>>    4.
>>
>>    Immutable and portable task execution untime
>>    5.
>>
>>    Process isolation ensures that all subprocesses of a task are cleaned
>>    up after docker exits (we have seen some orphaned hive, spark subprocesses
>>    after the airflow run process exits)
>>
>> ChangesAirflow Worker
>>
>> In the new design, the `airflow run local` and `airflow run raw`
>> processes are running inside a docker container, which is launched by an
>> airflow worker. In this way, the airflow worker runtime only needs minimum
>> requirements to run airflow core and docker.
>> Airflow Scheduler
>>
>> Instead of processing the DAG file directly, the DagFileProcessor process
>>
>>    1.
>>
>>    launches a docker container required by that DAG file to process it
>>    and persists the serializable DAGs (SimpleDags) to a file so that the
>>    result can be read outside the docker container
>>    2.
>>
>>    reads the file persisted from the docker container, deserializes it
>>    and puts the result into the multiprocess queue
>>
>>
>> This ensures the DAG parsing runtime is exactly the same as DAG execution
>> runtime.
>>
>> This requires a DAG definition file to tell the DAG file processing loop
>> to use which docker image to process it. We can easily achieve this by
>> having a metadata file along with the DAG definition file to define the
>> docker runtime. To ease the burden of users, a default docker image is
>> provided when a DAG definition file does not require customized runtime.
>> As a Whole
>>
>>
>>
>>
>>
>>
>> Best wishes
>>
>> Ping Zhang
>>
>

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

Thanks for the inputs. Yep, docker runtime is an add-on feature that is
controlled by a feature flag, and definitely *not* the default way to run
tasks.

I agree that this should be the next AIP. It actually does not affect how
the AIP-43 proposal is designed and written.

Best wishes

Ping Zhang


On Fri, Dec 17, 2021 at 12:43 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yeah. I think for sure "Docker" as a "common execution environment" is
> convenient in certain situations. But for sure it should not be the default
> as mentioned before (as much as I love containers I also know - from the
> surveys we run for one but also from interacting with many users of Airflow
> - for many of our users containers are not "default" way of doing things
> (and we should embrace it).
>
> I see some of the multi-tenancy deployment in the future could benefit
> from having different sets of dependencies - both for parsing and execution
> (say one environment per team). I think the current AIP-43 proposal
> already handles a big part of it. You could have - dockerised or not -
> different environments to parse your dags in per different "subfolder" - so
> each DagProcessor for the sub-folder could have a different set of
> dependencies (either coming from the virtualenv or Docker). This all
> without putting the "requirement" on using Docker. I think we are well
> aligned on the goal. I see that as the choice:
>
> a) whether Airflow should be able to choose the run "environment" to parse
> a Dag based on meta-data (which your proposal is about)
> b) whether each DagProcessor (different per team) should be started in the
> right "environment" to begin with (which is actually made possible by
> AIP-43) - which is part of Deployment not Airflow code.
>
> I think we should go with b) first and if b) will not be enough, future
> AIP to implement a) is also possible.
>
> When it comes to task execution - this is something we can definitely
> discuss in the future - as the next AIP. We should think about how to
> seamlessly map an execution of a task into different environments. We can
> definitely make it a point for discussion in the Jan meeting I plan
> to have.
>
> As Ash mentioned - we already have some ways to do it (Celery Queues, K8S
> executor, but also Airflow 2 @task.docker decorator). I am sure we can
> discuss those and see how we can address it after AIP-43/44 are
> discussed/approved and see if it makes sense to add another way.
>
> J.
>
>
> On Fri, Dec 17, 2021 at 11:53 AM Alexander Shorin <kx...@gmail.com>
> wrote:
>
>> How should your idea work on systems without docker? Like FreeBSD? And
>> why you made such leaky tasks which couldn't be isolated with common tools
>> like system packages, venv, etc.
>>
>> --
>> ,,,^..^,,,
>>
>>
>> On Fri, Dec 17, 2021 at 2:53 AM Ping Zhang <pi...@umich.edu> wrote:
>>
>>> Hi Airflow Community,
>>>
>>> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
>>> source our internal feature: docker runtime isolation for airflow tasks. It
>>> has been in our production for close to 1 year and it is very stable.
>>>
>>> I will create an AIP after the discussion.
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> Motivation
>>>
>>> Airflow worker host is a shared resource among all tasks running on it.
>>> Thus, it requires hosts to provision dependencies for all tasks, including
>>> system and python application level dependencies. It leads to a very fat
>>> runtime, thus long host provision time and low elasticity in the worker
>>> resource. This makes it challenging to prepare for unexpected burst load,
>>> including a large backfill or a rerun of large DAGs.
>>>
>>> The lack of runtime isolation makes it challenging and risky to do
>>> operations, including adding/upgrading system and python dependencies, and
>>> it is almost impossible to remove any dependencies. It also incurs lots of
>>> additional operating costs for the team as users do not have permission to
>>> add/upgrade python dependencies, which requires us to coordinate with them.
>>> When there are package version conflicts, it prevents installing them
>>> directly on the host. Users have to use PythonVirtualenvOperator, which
>>> slows down their development cycle.
>>>
>>> What change do you propose to make?
>>>
>>> To solve those problems, we propose introducing runtime isolation for
>>> Airflow tasks. It leverages docker as the tasks runtime environment. There
>>> are several benefits:
>>>
>>>    1.
>>>
>>>    Provide runtime isolation on task level
>>>    2.
>>>
>>>    Customize runtime to parse dag files
>>>    3.
>>>
>>>    Lean runtime on airflow host, which enables high worker resource
>>>    elasticity
>>>    4.
>>>
>>>    Immutable and portable task execution untime
>>>    5.
>>>
>>>    Process isolation ensures that all subprocesses of a task are
>>>    cleaned up after docker exits (we have seen some orphaned hive, spark
>>>    subprocesses after the airflow run process exits)
>>>
>>> ChangesAirflow Worker
>>>
>>> In the new design, the `airflow run local` and `airflow run raw`
>>> processes are running inside a docker container, which is launched by an
>>> airflow worker. In this way, the airflow worker runtime only needs minimum
>>> requirements to run airflow core and docker.
>>> Airflow Scheduler
>>>
>>> Instead of processing the DAG file directly, the DagFileProcessor
>>> process
>>>
>>>    1.
>>>
>>>    launches a docker container required by that DAG file to process it
>>>    and persists the serializable DAGs (SimpleDags) to a file so that the
>>>    result can be read outside the docker container
>>>    2.
>>>
>>>    reads the file persisted from the docker container, deserializes it
>>>    and puts the result into the multiprocess queue
>>>
>>>
>>> This ensures the DAG parsing runtime is exactly the same as DAG
>>> execution runtime.
>>>
>>> This requires a DAG definition file to tell the DAG file processing loop
>>> to use which docker image to process it. We can easily achieve this by
>>> having a metadata file along with the DAG definition file to define the
>>> docker runtime. To ease the burden of users, a default docker image is
>>> provided when a DAG definition file does not require customized runtime.
>>> As a Whole
>>>
>>>
>>>
>>>
>>>
>>>
>>> Best wishes
>>>
>>> Ping Zhang
>>>
>>

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yeah. I think for sure "Docker" as a "common execution environment" is
convenient in certain situations. But for sure it should not be the default
as mentioned before (as much as I love containers I also know - from the
surveys we run for one but also from interacting with many users of Airflow
- for many of our users containers are not "default" way of doing things
(and we should embrace it).

I see some of the multi-tenancy deployment in the future could benefit from
having different sets of dependencies - both for parsing and execution (say
one environment per team). I think the current AIP-43 proposal
already handles a big part of it. You could have - dockerised or not -
different environments to parse your dags in per different "subfolder" - so
each DagProcessor for the sub-folder could have a different set of
dependencies (either coming from the virtualenv or Docker). This all
without putting the "requirement" on using Docker. I think we are well
aligned on the goal. I see that as the choice:

a) whether Airflow should be able to choose the run "environment" to parse
a Dag based on meta-data (which your proposal is about)
b) whether each DagProcessor (different per team) should be started in the
right "environment" to begin with (which is actually made possible by
AIP-43) - which is part of Deployment not Airflow code.

I think we should go with b) first and if b) will not be enough, future AIP
to implement a) is also possible.

When it comes to task execution - this is something we can definitely
discuss in the future - as the next AIP. We should think about how to
seamlessly map an execution of a task into different environments. We can
definitely make it a point for discussion in the Jan meeting I plan
to have.

As Ash mentioned - we already have some ways to do it (Celery Queues, K8S
executor, but also Airflow 2 @task.docker decorator). I am sure we can
discuss those and see how we can address it after AIP-43/44 are
discussed/approved and see if it makes sense to add another way.

J.


On Fri, Dec 17, 2021 at 11:53 AM Alexander Shorin <kx...@gmail.com> wrote:

> How should your idea work on systems without docker? Like FreeBSD? And why
> you made such leaky tasks which couldn't be isolated with common tools like
> system packages, venv, etc.
>
> --
> ,,,^..^,,,
>
>
> On Fri, Dec 17, 2021 at 2:53 AM Ping Zhang <pi...@umich.edu> wrote:
>
>> Hi Airflow Community,
>>
>> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
>> source our internal feature: docker runtime isolation for airflow tasks. It
>> has been in our production for close to 1 year and it is very stable.
>>
>> I will create an AIP after the discussion.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> Motivation
>>
>> Airflow worker host is a shared resource among all tasks running on it.
>> Thus, it requires hosts to provision dependencies for all tasks, including
>> system and python application level dependencies. It leads to a very fat
>> runtime, thus long host provision time and low elasticity in the worker
>> resource. This makes it challenging to prepare for unexpected burst load,
>> including a large backfill or a rerun of large DAGs.
>>
>> The lack of runtime isolation makes it challenging and risky to do
>> operations, including adding/upgrading system and python dependencies, and
>> it is almost impossible to remove any dependencies. It also incurs lots of
>> additional operating costs for the team as users do not have permission to
>> add/upgrade python dependencies, which requires us to coordinate with them.
>> When there are package version conflicts, it prevents installing them
>> directly on the host. Users have to use PythonVirtualenvOperator, which
>> slows down their development cycle.
>>
>> What change do you propose to make?
>>
>> To solve those problems, we propose introducing runtime isolation for
>> Airflow tasks. It leverages docker as the tasks runtime environment. There
>> are several benefits:
>>
>>    1.
>>
>>    Provide runtime isolation on task level
>>    2.
>>
>>    Customize runtime to parse dag files
>>    3.
>>
>>    Lean runtime on airflow host, which enables high worker resource
>>    elasticity
>>    4.
>>
>>    Immutable and portable task execution untime
>>    5.
>>
>>    Process isolation ensures that all subprocesses of a task are cleaned
>>    up after docker exits (we have seen some orphaned hive, spark subprocesses
>>    after the airflow run process exits)
>>
>> ChangesAirflow Worker
>>
>> In the new design, the `airflow run local` and `airflow run raw`
>> processes are running inside a docker container, which is launched by an
>> airflow worker. In this way, the airflow worker runtime only needs minimum
>> requirements to run airflow core and docker.
>> Airflow Scheduler
>>
>> Instead of processing the DAG file directly, the DagFileProcessor process
>>
>>    1.
>>
>>    launches a docker container required by that DAG file to process it
>>    and persists the serializable DAGs (SimpleDags) to a file so that the
>>    result can be read outside the docker container
>>    2.
>>
>>    reads the file persisted from the docker container, deserializes it
>>    and puts the result into the multiprocess queue
>>
>>
>> This ensures the DAG parsing runtime is exactly the same as DAG execution
>> runtime.
>>
>> This requires a DAG definition file to tell the DAG file processing loop
>> to use which docker image to process it. We can easily achieve this by
>> having a metadata file along with the DAG definition file to define the
>> docker runtime. To ease the burden of users, a default docker image is
>> provided when a DAG definition file does not require customized runtime.
>> As a Whole
>>
>>
>>
>>
>>
>>
>> Best wishes
>>
>> Ping Zhang
>>
>

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Posted by Alexander Shorin <kx...@gmail.com>.
How should your idea work on systems without docker? Like FreeBSD? And why
you made such leaky tasks which couldn't be isolated with common tools like
system packages, venv, etc.

--
,,,^..^,,,


On Fri, Dec 17, 2021 at 2:53 AM Ping Zhang <pi...@umich.edu> wrote:

> Hi Airflow Community,
>
> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
> source our internal feature: docker runtime isolation for airflow tasks. It
> has been in our production for close to 1 year and it is very stable.
>
> I will create an AIP after the discussion.
>
> Thanks,
>
> Ping
>
>
> Motivation
>
> Airflow worker host is a shared resource among all tasks running on it.
> Thus, it requires hosts to provision dependencies for all tasks, including
> system and python application level dependencies. It leads to a very fat
> runtime, thus long host provision time and low elasticity in the worker
> resource. This makes it challenging to prepare for unexpected burst load,
> including a large backfill or a rerun of large DAGs.
>
> The lack of runtime isolation makes it challenging and risky to do
> operations, including adding/upgrading system and python dependencies, and
> it is almost impossible to remove any dependencies. It also incurs lots of
> additional operating costs for the team as users do not have permission to
> add/upgrade python dependencies, which requires us to coordinate with them.
> When there are package version conflicts, it prevents installing them
> directly on the host. Users have to use PythonVirtualenvOperator, which
> slows down their development cycle.
>
> What change do you propose to make?
>
> To solve those problems, we propose introducing runtime isolation for
> Airflow tasks. It leverages docker as the tasks runtime environment. There
> are several benefits:
>
>    1.
>
>    Provide runtime isolation on task level
>    2.
>
>    Customize runtime to parse dag files
>    3.
>
>    Lean runtime on airflow host, which enables high worker resource
>    elasticity
>    4.
>
>    Immutable and portable task execution untime
>    5.
>
>    Process isolation ensures that all subprocesses of a task are cleaned
>    up after docker exits (we have seen some orphaned hive, spark subprocesses
>    after the airflow run process exits)
>
> ChangesAirflow Worker
>
> In the new design, the `airflow run local` and `airflow run raw`
> processes are running inside a docker container, which is launched by an
> airflow worker. In this way, the airflow worker runtime only needs minimum
> requirements to run airflow core and docker.
> Airflow Scheduler
>
> Instead of processing the DAG file directly, the DagFileProcessor process
>
>    1.
>
>    launches a docker container required by that DAG file to process it
>    and persists the serializable DAGs (SimpleDags) to a file so that the
>    result can be read outside the docker container
>    2.
>
>    reads the file persisted from the docker container, deserializes it
>    and puts the result into the multiprocess queue
>
>
> This ensures the DAG parsing runtime is exactly the same as DAG execution
> runtime.
>
> This requires a DAG definition file to tell the DAG file processing loop
> to use which docker image to process it. We can easily achieve this by
> having a metadata file along with the DAG definition file to define the
> docker runtime. To ease the burden of users, a default docker image is
> provided when a DAG definition file does not require customized runtime.
> As a Whole
>
>
>
>
>
>
> Best wishes
>
> Ping Zhang
>

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Posted by Ping Zhang <pi...@umich.edu>.
Hi Ash,

Thanks for the inputs. I should have specially called out that the docker
runtime is an add-on feature that is controlled by a feature flag.

Users/infra team can choose to enable it or not. When not enabled, it stays
with the current behavior.

This docker runtime feature has helped a lot during our py3 upgrade
project. With this, we just built a py3 docker image to run tasks and parse
dags without needing to spin up a new airflow cluster.

Best wishes

Ping Zhang


On Fri, Dec 17, 2021 at 2:31 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> Hi Ping,
>
> (The dev list doesn't allow attachments, so we can't see any of the images
> you've posted, so some of my questions might have been addressed by those
> images.)
>
> It seems that a lot of the goals here are overlapping with the AIP-1 and
> proposed separation of dag processor from scheduler and multi-tenancy work
> in general.
> Your description of how the scheduler and DAG parsing process operate is
> based on 1.10 mode of operation, but that has changed in 2.0 -- the
> scheduler _only_ operates on the serialized representation and doesn't need
> the result of the dag parsing process. Breaking this tight coupling was one
> of the major speed ups I achieved.
>
> It's not clear from your email the exact details yet, but my initial
> comments:
>
> 1. Runtime isolation of task execution is already possible by using the
> KubernetesExecutor
>
> 2. Running short-lived process (such as what I think you are proposing for
> dag parsing) in a Kube cluster isn't really practical as the spin up time
> of pods is highly variable and can be to the order of minutes
>
> 3. Not everyone has docker available or is comfortable running it -- we
> 100% need to support running without Docker or containers still.
>
> 4. Many of our users are Data Scientists or Engineers, and so aren't happy
> with building containers.
>
> On Thu, Dec 16 2021 at 15:52:02 -0800, Ping Zhang <pi...@umich.edu>
> wrote:
>
> Hi Airflow Community,
>
> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
> source our internal feature: docker runtime isolation for airflow tasks. It
> has been in our production for close to 1 year and it is very stable.
>
> I will create an AIP after the discussion.
>
> Thanks,
>
> Ping
>
>
> Motivation
>
> Airflow worker host is a shared resource among all tasks running on it.
> Thus, it requires hosts to provision dependencies for all tasks, including
> system and python application level dependencies. It leads to a very fat
> runtime, thus long host provision time and low elasticity in the worker
> resource. This makes it challenging to prepare for unexpected burst load,
> including a large backfill or a rerun of large DAGs.
>
> The lack of runtime isolation makes it challenging and risky to do
> operations, including adding/upgrading system and python dependencies, and
> it is almost impossible to remove any dependencies. It also incurs lots of
> additional operating costs for the team as users do not have permission to
> add/upgrade python dependencies, which requires us to coordinate with them.
> When there are package version conflicts, it prevents installing them
> directly on the host. Users have to use PythonVirtualenvOperator, which
> slows down their development cycle.
>
> What change do you propose to make?
>
> To solve those problems, we propose introducing runtime isolation for
> Airflow tasks. It leverages docker as the tasks runtime environment. There
> are several benefits:
>
>    1.
>
>    Provide runtime isolation on task level
>    2.
>
>    Customize runtime to parse dag files
>    3.
>
>    Lean runtime on airflow host, which enables high worker resource
>    elasticity
>    4.
>
>    Immutable and portable task execution untime
>    5.
>
>    Process isolation ensures that all subprocesses of a task are cleaned
>    up after docker exits (we have seen some orphaned hive, spark subprocesses
>    after the airflow run process exits)
>
> ChangesAirflow Worker
>
> In the new design, the `airflow run local` and `airflow run raw`
> processes are running inside a docker container, which is launched by an
> airflow worker. In this way, the airflow worker runtime only needs minimum
> requirements to run airflow core and docker.
> Airflow Scheduler
>
> Instead of processing the DAG file directly, the DagFileProcessor process
>
>    1.
>
>    launches a docker container required by that DAG file to process it
>    and persists the serializable DAGs (SimpleDags) to a file so that the
>    result can be read outside the docker container
>    2.
>
>    reads the file persisted from the docker container, deserializes it
>    and puts the result into the multiprocess queue
>
>
> This ensures the DAG parsing runtime is exactly the same as DAG execution
> runtime.
>
> This requires a DAG definition file to tell the DAG file processing loop
> to use which docker image to process it. We can easily achieve this by
> having a metadata file along with the DAG definition file to define the
> docker runtime. To ease the burden of users, a default docker image is
> provided when a DAG definition file does not require customized runtime.
> As a Whole
>
>
>
>
>
>
> Best wishes
>
> Ping Zhang
>
>

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Posted by Ash Berlin-Taylor <as...@apache.org>.
Hi Ping,

(The dev list doesn't allow attachments, so we can't see any of the 
images you've posted, so some of my questions might have been addressed 
by those images.)

It seems that a lot of the goals here are overlapping with the AIP-1 
and proposed separation of dag processor from scheduler and 
multi-tenancy work in general.
Your description of how the scheduler and DAG parsing process operate 
is based on 1.10 mode of operation, but that has changed in 2.0 -- the 
scheduler _only_ operates on the serialized representation and doesn't 
need the result of the dag parsing process. Breaking this tight 
coupling was one of the major speed ups I achieved.

It's not clear from your email the exact details yet, but my initial 
comments:

1. Runtime isolation of task execution is already possible by using the 
KubernetesExecutor

2. Running short-lived process (such as what I think you are proposing 
for dag parsing) in a Kube cluster isn't really practical as the spin 
up time of pods is highly variable and can be to the order of minutes

3. Not everyone has docker available or is comfortable running it -- we 
100% need to support running without Docker or containers still.

4. Many of our users are Data Scientists or Engineers, and so aren't 
happy with building containers.

On Thu, Dec 16 2021 at 15:52:02 -0800, Ping Zhang <pi...@umich.edu> 
wrote:
> Hi Airflow Community,
> 
> This is Ping Zhang from the Airbnb Airflow team.  We would like to 
> open source our internal feature: docker runtime isolation for 
> airflow tasks. It has been in our production for close to 1 year and 
> it is very stable.
> 
> I will create an AIP after the discussion.
> 
> Thanks,
> 
> Ping
> 
> Motivation
> 
> Airflow worker host is a shared resource among all tasks running on 
> it. Thus, it requires hosts to provision dependencies for all tasks, 
> including system and python application level dependencies. It leads 
> to a very fat runtime, thus long host provision time and low 
> elasticity in the worker resource. This makes it challenging to 
> prepare for unexpected burst load, including a large backfill or a 
> rerun of large DAGs.
> 
> The lack of runtime isolation makes it challenging and risky to do 
> operations, including adding/upgrading system and python 
> dependencies, and it is almost impossible to remove any dependencies. 
> It also incurs lots of additional operating costs for the team as 
> users do not have permission to add/upgrade python dependencies, 
> which requires us to coordinate with them. When there are package 
> version conflicts, it prevents installing them directly on the host. 
> Users have to use PythonVirtualenvOperator, which slows down their 
> development cycle.
> 
> 
> What change do you propose to make?
> 
> To solve those problems, we propose introducing runtime isolation for 
> Airflow tasks. It leverages docker as the tasks runtime environment. 
> There are several benefits:
> 
> Provide runtime isolation on task level
> 
> Customize runtime to parse dag files
> 
> Lean runtime on airflow host, which enables high worker resource 
> elasticity
> 
> Immutable and portable task execution untime
> 
> Process isolation ensures that all subprocesses of a task are cleaned 
> up after docker exits (we have seen some orphaned hive, spark 
> subprocesses after the airflow run process exits)
> 
> Changes
> Airflow Worker
> In the new design, the `airflow run local` and `airflow run raw` 
> processes are running inside a docker container, which is launched by 
> an airflow worker. In this way, the airflow worker runtime only needs 
> minimum requirements to run airflow core and docker.
> 
> Airflow Scheduler
> Instead of processing the DAG file directly, the DagFileProcessor 
> process
> 
> launches a docker container required by that DAG file to process it 
> and persists the serializable DAGs (SimpleDags) to a file so that the 
> result can be read outside the docker container
> 
> reads the file persisted from the docker container, deserializes it 
> and puts the result into the multiprocess queue
> 
> 
> This ensures the DAG parsing runtime is exactly the same as DAG 
> execution runtime.
> 
> This requires a DAG definition file to tell the DAG file processing 
> loop to use which docker image to process it. We can easily achieve 
> this by having a metadata file along with the DAG definition file to 
> define the docker runtime. To ease the burden of users, a default 
> docker image is provided when a DAG definition file does not require 
> customized runtime.
> 
> As a Whole
> 
> 
> 
> 
> 
> Best wishes
> 
> Ping Zhang