You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@airflow.apache.org by Chris Redekop <ch...@replicon.com> on 2022/01/14 17:18:13 UTC

How to sandbox the tasks from each other?

Hi all!...brand new user here.  I'm evaluating airflow (AWS MWAA) as a
solution to provide integration/workflow services for multiple
tenants....and 1 risk area I'm trying to mitigate is the '/tmp' local disk
storage. Tasks for multiple tenants will commonly be writing data to the
/tmp dir, and it would be a *really* big deal if somehow they conflicted
and 1 tenant's data was exposed to another tenant. These workflows will be
being written by a team of, shall we say, not super-strong developers...so
conflicts are likely to happen eventually, and proper tempfile cleanup is
likely to be missed. Are there any common strategies to deal with this
risk?
I was thinking of installing a task instance mutation cluster policy hook
which would "rm -rf /tmp/*" before every task run, but of course that's
sketchy and would very possibly break other tasks which are running
concurrently on the same worker.
To automate cleanup (so worker disks don't eventually fill up) the only
other thing I can think of is to provide a library with a get_temp_file()
method which generates timestamped temp file names, so we can automatically
delete files >a day old (or whatever)....but of course this relies on the
team to diligently use our library method rather than the standard python
method, or (god forbid) hardcoding their own filenames.
Any thoughts or insights are appreciated. Thanks!

Re: How to sandbox the tasks from each other?

Posted by Chris Redekop <ch...@replicon.com>.
Thanks Jarek, this is all very good info to know. Unfortunately I'm doing
my evaluation on AWS MWAA so changing the executor isn't an option (at
least not yet). It'll probably be a hard sell to convince management to run
a self-managed airflow cluster to get the heightened isolation of the
Kubernetes Executor, but maybe it's worth floating the idea :)  Thanks for
the info!

On Sat, Jan 15, 2022 at 11:18 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> > I mean "one team writing DAGs for multiple clients, and those tasks
> can't collide". We don't require actual security from malicious users, we
> just need some safety rails to prevent accidents.
>
> I think "/tmp" is only one of the problems. I am not sure if you are
> aware but DAG writers have a lot of power in the current airflow. They
> could even accidentally - for example - delete the whole metadata db
> or all dag history by issuing an ORM command to delete those. There
> are no protections (and that's by design until multi-tenancy is
> implemented. So worrying about /tmp accidental clashes by
> inexperienced users is the least of your worries I believe. Airflow
> (currently) assumes a lot of trust in the DAG writers that they are
> not doing anything "crazy" (again this is by design assumption is that
> DAG writers know what they are doing and their code is reviewed by
> their peers before executed).
>
> However when you do want to only focus on file access, then /tmp is
> also not your only problem. Depending which executors you use there
> are also other possibilities of "clashing"
>
> 1) Local Executor- the tasks are run as processes on the same machine
> as scheduler and ANY file (not only /tmp) can be shared/overwritten.
> If your teams choose some "/file/file-storage" they could also
> overwrite those files (there is no way to provide different access
> level to tasks belonging to different tasks
>
> 2) Celery Executor - those are usually separated from scheduler but
> still one "Worker" can handle multiple tasks from (potentially)
> different teams and same problems can occur. You can potentially
> separate different teams by using different queues (and each team
> having separate set of workers) but this is not at all "safe" as any
> DAG writer can override the queue to another value - effectively any
> team member can run the dags as another team member. No protection
> against that (except code review) is built-in currently.
>
> 3) Kubernetes Executor - here the situation is a bit better. Each task
> is always run in a separate new POD and the only shared volumes are
> those which you explicitly add in POD template (but still a user could
> run conceptually `DELETE * from DA` and delete all dags from all
> teams. No protection against such cases in this case (same in
> Local/Celery) is possible currently.
>
> So In short - there are no "good" protections. If you want to protect
> against "accidental" /tmp file override between teams - use K8S
> executor.
>
> What you could also provide is to set TMP_DIR to a different path for
> each team or make your teams only use DockerOperator or K8S operator
> to introduce file-level separation (but this would require some
> conventions adopted by the teams and trust that they are not breaking
> them - there is nothing in Airflow to enforce those. You could
> potentially "check" some of those via cluster policies:
>
> https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html
> - but those checks will only be able to "check" if your conventions
> are followed, but you would not be able to detect if a member of one
> team pretends to be a member of another team (unless you also add some
> separation of folders and permissions for dag submissions and link the
> location of DAGs to DAG location). This is not foul-proof (because any
> DAG writer could override the location dynamically when DAG is parsed.
>
> J.
>
> On Fri, Jan 14, 2022 at 9:40 PM Chris Redekop <ch...@replicon.com> wrote:
> >
> > I mean "one team writing DAGs for multiple clients, and those tasks
> can't collide". We don't require actual security from malicious users, we
> just need some safety rails to prevent accidents.
> >
> > On Fri, Jan 14, 2022 at 1:31 PM Jed Cunningham <je...@apache.org>
> wrote:
> >>
> >> Hey Chris,
> >>
> >> I think the answer depends on what you mean by "multi-tenancy". I think
> you mean one team writing DAGs for multiple clients and those tasks can't
> collide. If so, the easiest way to have isolated workers is with
> KubernetesExecutor. No shared tmp!
> >>
> >> If instead you mean multiple teams sharing an instance (what I consider
> multi-tenancy), it's a totally different situation, and in most cases
> having separate instances is the right call if you require "security".
> >>
> >> Remember, DAGs are arbitrary python and you can do all sorts of
> interesting things in them. Do you need isolation for accidental
> collisions, or do you need to protect tenant-a from
> possibly-bad-actor-tenant-b?
> >>
> >> More reading on Airflow multi-tenancy:
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-1%3A+Improve+Airflow+Security
> >>
> https://lists.apache.org/list?dev@airflow.apache.org:lte=1y:multi-tenancy
> >>
> >> Jed
>

Re: How to sandbox the tasks from each other?

Posted by Jarek Potiuk <ja...@potiuk.com>.
> I mean "one team writing DAGs for multiple clients, and those tasks can't collide". We don't require actual security from malicious users, we just need some safety rails to prevent accidents.

I think "/tmp" is only one of the problems. I am not sure if you are
aware but DAG writers have a lot of power in the current airflow. They
could even accidentally - for example - delete the whole metadata db
or all dag history by issuing an ORM command to delete those. There
are no protections (and that's by design until multi-tenancy is
implemented. So worrying about /tmp accidental clashes by
inexperienced users is the least of your worries I believe. Airflow
(currently) assumes a lot of trust in the DAG writers that they are
not doing anything "crazy" (again this is by design assumption is that
DAG writers know what they are doing and their code is reviewed by
their peers before executed).

However when you do want to only focus on file access, then /tmp is
also not your only problem. Depending which executors you use there
are also other possibilities of "clashing"

1) Local Executor- the tasks are run as processes on the same machine
as scheduler and ANY file (not only /tmp) can be shared/overwritten.
If your teams choose some "/file/file-storage" they could also
overwrite those files (there is no way to provide different access
level to tasks belonging to different tasks

2) Celery Executor - those are usually separated from scheduler but
still one "Worker" can handle multiple tasks from (potentially)
different teams and same problems can occur. You can potentially
separate different teams by using different queues (and each team
having separate set of workers) but this is not at all "safe" as any
DAG writer can override the queue to another value - effectively any
team member can run the dags as another team member. No protection
against that (except code review) is built-in currently.

3) Kubernetes Executor - here the situation is a bit better. Each task
is always run in a separate new POD and the only shared volumes are
those which you explicitly add in POD template (but still a user could
run conceptually `DELETE * from DA` and delete all dags from all
teams. No protection against such cases in this case (same in
Local/Celery) is possible currently.

So In short - there are no "good" protections. If you want to protect
against "accidental" /tmp file override between teams - use K8S
executor.

What you could also provide is to set TMP_DIR to a different path for
each team or make your teams only use DockerOperator or K8S operator
to introduce file-level separation (but this would require some
conventions adopted by the teams and trust that they are not breaking
them - there is nothing in Airflow to enforce those. You could
potentially "check" some of those via cluster policies:
https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html
- but those checks will only be able to "check" if your conventions
are followed, but you would not be able to detect if a member of one
team pretends to be a member of another team (unless you also add some
separation of folders and permissions for dag submissions and link the
location of DAGs to DAG location). This is not foul-proof (because any
DAG writer could override the location dynamically when DAG is parsed.

J.

On Fri, Jan 14, 2022 at 9:40 PM Chris Redekop <ch...@replicon.com> wrote:
>
> I mean "one team writing DAGs for multiple clients, and those tasks can't collide". We don't require actual security from malicious users, we just need some safety rails to prevent accidents.
>
> On Fri, Jan 14, 2022 at 1:31 PM Jed Cunningham <je...@apache.org> wrote:
>>
>> Hey Chris,
>>
>> I think the answer depends on what you mean by "multi-tenancy". I think you mean one team writing DAGs for multiple clients and those tasks can't collide. If so, the easiest way to have isolated workers is with KubernetesExecutor. No shared tmp!
>>
>> If instead you mean multiple teams sharing an instance (what I consider multi-tenancy), it's a totally different situation, and in most cases having separate instances is the right call if you require "security".
>>
>> Remember, DAGs are arbitrary python and you can do all sorts of interesting things in them. Do you need isolation for accidental collisions, or do you need to protect tenant-a from possibly-bad-actor-tenant-b?
>>
>> More reading on Airflow multi-tenancy:
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-1%3A+Improve+Airflow+Security
>> https://lists.apache.org/list?dev@airflow.apache.org:lte=1y:multi-tenancy
>>
>> Jed

Re: How to sandbox the tasks from each other?

Posted by Chris Redekop <ch...@replicon.com>.
I mean "one team writing DAGs for multiple clients, and those tasks can't
collide". We don't require actual security from malicious users, we just
need some safety rails to prevent accidents.

On Fri, Jan 14, 2022 at 1:31 PM Jed Cunningham <je...@apache.org>
wrote:

> Hey Chris,
>
> I think the answer depends on what you mean by "multi-tenancy". I think
> you mean one team writing DAGs for multiple clients and those tasks can't
> collide. If so, the easiest way to have isolated workers is with
> KubernetesExecutor. No shared tmp!
>
> If instead you mean multiple teams sharing an instance (what I consider
> multi-tenancy), it's a totally different situation, and in most cases
> having separate instances is the right call if you require "security".
>
> Remember, DAGs are arbitrary python and you can do all sorts of
> interesting things in them. Do you need isolation for accidental
> collisions, or do you need to protect tenant-a from
> possibly-bad-actor-tenant-b?
>
> More reading on Airflow multi-tenancy:
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-1%3A+Improve+Airflow+Security
> https://lists.apache.org/list?dev@airflow.apache.org:lte=1y:multi-tenancy
>
> Jed
>

Re: How to sandbox the tasks from each other?

Posted by Jed Cunningham <je...@apache.org>.
Hey Chris,

I think the answer depends on what you mean by "multi-tenancy". I think you
mean one team writing DAGs for multiple clients and those tasks can't
collide. If so, the easiest way to have isolated workers is with
KubernetesExecutor. No shared tmp!

If instead you mean multiple teams sharing an instance (what I consider
multi-tenancy), it's a totally different situation, and in most cases
having separate instances is the right call if you require "security".

Remember, DAGs are arbitrary python and you can do all sorts of interesting
things in them. Do you need isolation for accidental collisions, or do you
need to protect tenant-a from possibly-bad-actor-tenant-b?

More reading on Airflow multi-tenancy:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-1%3A+Improve+Airflow+Security
https://lists.apache.org/list?dev@airflow.apache.org:lte=1y:multi-tenancy

Jed