You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <ja...@potiuk.com> on 2023/12/04 18:14:36 UTC

[DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Let me start separate threads from the main discussion. It will be easier
to follow.

I'd really like to understand the actors involved and what kind of threats
the allowlist is supposed to protect us.

I think it would be great if we see the "attack" scenario we think
allowlist will protect us against - explaining the actors involved and what
actions they could make that the allowlist prevent.

It might be that I do not see it (but as experiences from our security team
had shown - having a clear step-by-step explanation of what happens allows
us to reason better about it and see if the protection is actually
"functional". I think for now it's not doing much (except making it a bit
more harder for someone who would like to exploit it) , but I might be
mistaken (happened many times during our security discussions.


Pickle and the likes can execute arbitrary code that is inside the
> serialized object.
>

Yep. This is super dangerous indeed.


> Airflow's serializers do not execute arbitrary code. Also your assumption
> of the 'serious' security problem is incorrect.


I think that's a misunderstanding. I simply think that whatever we are
trying to protect against is simply not effective because it might be
bypassed easily by DAG authors (if this is the DAG authors action we are
trying to protect against). I am not sure if it is serious or not because I
am not sure what we are trying to protect against.

But I believe whatever we do might simply not work as intended and even
more - I think it's quite unnecessary.

So I want to know the intentions, actors and scenarios :).

I think clearly spelling out actors involved and scenario of the attack we
are trying to protect against should help with clarifying it. I think it's
been hinted at in the next few paragraphs, but I wanted to make sure that I
understand it.


> The deserializer will need
> to be able to load the class by using 'import_string' after verifying that
> it matches the allowed pattern (by default 'airflow.*'). If a DAG author
> does something like `airflow.providers.bla.bla` that class cannot be loaded
> neither can the DAG author make that class available to a different DAG in
> a different context (which is where a real attack vector would be).
>

So do I understand correctly that this one is aimed to protect against one
DAG author (writing DAGA and storing data to XCom to not to allow it to get
that XCom in DAG B and execute a code that  DAG A author wanted DAG B to
execute during deserialization?

In other words - we want to make sure that DAGS are somewhat  isolated from
each other?

For now we are ignoring the fact that tasks for different DAGs are
executed potentially on the same worker, in a fork of the same process,
which has potentially other ways where one task influences what other tasks
are doing. This is a different scenario / layer and I do understand
the concept of layered protection but wanted to make sure that it's been a
deliberate consideration.

Is this a fair assesment?

Are there any other reasons for that allowlist ?

Finally - did we consider the potential atta scenario where importing
subpackagest might come from multiple locations on PYTHONPATH (I am not
sure it is possible to exploit in our case but as of
https://peps.python.org/pep-0420/ you can  import sub-packages of packages
from several locations) ?

An attack vector is considered (as mentioned above) if you can change
> something in someone else's context. So, if I would be able to serialize
> something and then influence / adjust its deserialization in a different
> context which should not be under my control.
>

Yeah. I think here the "my control" and "others" and "different context"
(i.e. actors involved) were not clear to me (still not entirely).

 We have a security model (just updated it to clarify some of the DAG
author expectations) where currently we do not distinguish between
different DAG authors - we assume that all DAG Authors have access to the
same execution environment and generally speaking we cannot assume that
other DAG authors are isolated from each other. It's just impossible
(currently) for multiple reasons.

We are working on changing it with AIP-44 and one more not-yet-written-aip,
and maybe it's good that this serialisation is good to  add extra
additional layer of security, but I think it has the potential of giving
our users false sense of security - if we do not properly describe it in
the model. So we need to be very clear when we are describing the intended
level of isolation provided.


J.

Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Posted by Bolke de Bruin <bd...@gmail.com>.
On Mon, 4 Dec 2023 at 19:15, Jarek Potiuk <ja...@potiuk.com> wrote:

>
> > Airflow's serializers do not execute arbitrary code. Also your assumption
> > of the 'serious' security problem is incorrect.
>
>
> I think that's a misunderstanding. I simply think that whatever we are
> trying to protect against is simply not effective because it might be
> bypassed easily by DAG authors (if this is the DAG authors action we are
> trying to protect against). I am not sure if it is serious or not because I
> am not sure what we are trying to protect against.
>
> But I believe whatever we do might simply not work as intended and even
> more - I think it's quite unnecessary.
>
> So I want to know the intentions, actors and scenarios :).
>
> I think clearly spelling out actors involved and scenario of the attack we
> are trying to protect against should help with clarifying it. I think it's
> been hinted at in the next few paragraphs, but I wanted to make sure that I
> understand it.
>
>

> > The deserializer will need
> > to be able to load the class by using 'import_string' after verifying
> that
> > it matches the allowed pattern (by default 'airflow.*'). If a DAG author
> > does something like `airflow.providers.bla.bla` that class cannot be
> loaded
> > neither can the DAG author make that class available to a different DAG
> in
> > a different context (which is where a real attack vector would be).
> >
>
> So do I understand correctly that this one is aimed to protect against one
> DAG author (writing DAGA and storing data to XCom to not to allow it to get
> that XCom in DAG B and execute a code that  DAG A author wanted DAG B to
> execute during deserialization?
>
> In other words - we want to make sure that DAGS are somewhat  isolated from
> each other?
>
> For now we are ignoring the fact that tasks for different DAGs are
> executed potentially on the same worker, in a fork of the same process,
> which has potentially other ways where one task influences what other tasks
> are doing.


Yes, this really is a different layer where the OS provides certain
guarantees
and where Airflow has other protections (or maybe not).


> This is a different scenario / layer and I do understand
> the concept of layered protection but wanted to make sure that it's been a
> deliberate consideration.
>
> Is this a fair assesment?
>
>
Yes. The intention here is to protect the confidentiality of the DAG
(worker, content)
of User 1 (U1) against Malicious Actor 1 (MA1). These are different users
of the system,
both can author but they do not have direct access to each other's DAGs.

You can craft a dict that gets deserialized into an arbitrary object. This
dict can be pushed
into the DB (manually) as an XCom value. This object could run a
deserialize method that
does the same as the 'PickleBomb'.

The allow list protects against this vector as we have a "vetted" list of
objects that can be
deserialized into. So while the integrity of the DAG can be affected
(different value), you
shouldn't be able to affect the confidentiality.


> Are there any other reasons for that allowlist ?
>
>
I think the above is sufficient.


> Finally - did we consider the potential atta scenario where importing
> subpackagest might come from multiple locations on PYTHONPATH (I am not
> sure it is possible to exploit in our case but as of
> https://peps.python.org/pep-0420/ you can  import sub-packages of packages
> from several locations) ?
>

The assumption here is that the OPS environment is immutable / in a known
state.
Thus the DAG of U1 runs in a separate process from the DAG of MA1. That
means
( = it is assumed) that MA1 cannot change the PYTHONPATH of U1.


> An attack vector is considered (as mentioned above) if you can change
> > something in someone else's context. So, if I would be able to serialize
> > something and then influence / adjust its deserialization in a different
> > context which should not be under my control.
> >
>
> Yeah. I think here the "my control" and "others" and "different context"
> (i.e. actors involved) were not clear to me (still not entirely).
>
>  We have a security model (just updated it to clarify some of the DAG
> author expectations) where currently we do not distinguish between
> different DAG authors - we assume that all DAG Authors have access to the
> same execution environment and generally speaking we cannot assume that
> other DAG authors are isolated from each other. It's just impossible
> (currently) for multiple reasons.
>
> We are working on changing it with AIP-44 and one more not-yet-written-aip,
> and maybe it's good that this serialisation is good to  add extra
> additional layer of security, but I think it has the potential of giving
> our users false sense of security - if we do not properly describe it in
> the model. So we need to be very clear when we are describing the intended
> level of isolation provided.
>
>
We are not just talking about the different DAG authors here. The DAG of U1
might run in an environment different from the DAG of MA1. If MA1 wants to
get
extra information on that environment by doing something in the context of
U1 it
affects OPS which is, in good practices, not (managed by) the same person
as U1.

Bolke
-- 

--
Bolke de Bruin
bdbruin@gmail.com

Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Posted by Bolke de Bruin <bd...@gmail.com>.
On Mon, 4 Dec 2023 at 20:16, Andrey Anshin <an...@taragol.is> wrote:

> > Pickle and the likes can execute arbitrary code that is inside the
> > serialized object.
> >
>
> > Yep. This is super dangerous indeed.
>
> My fifty cents. This sounds scarier than it actually is, it mostly covered
> by this simple things:
>

I think you underestimate what challenges malicious actors pose. Most
attacks come from the inside. For example you do not want a DEVeloper of a
DAG to have access to OPS of the workers. If you can craft a pickle that
does something else in one location then somewhere else that is a serious
issue.

Basically you can rely on pickle when you are sure you control both sides
and the user cannot interfere. This is what Spark does for example with
Spark Connect. We cannot rely on that because the user is in between.

Bolke

-- 

--
Bolke de Bruin
bdbruin@gmail.com

Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Posted by Andrey Anshin <an...@taragol.is>.
> Pickle and the likes can execute arbitrary code that is inside the
> serialized object.
>

> Yep. This is super dangerous indeed.

My fifty cents. This sounds scarier than it actually is, it mostly covered
by this simple things:
- Do not open attachment from unknown senders /  untrusted sources
- Do not open the links from unknown senders /  untrusted sources
- Do not fill personal information / credit card details on crappy sites
- Do not install random software from the Internet
- Do not run strange perl scripts into the console
- Do not run software on superuser
- Use repeatable installation, which prevent situation that owner of some
package lost control of their PyPI and someone upload new modified version
of software
- Do not try to deserialize pickled objects from unknown sources

The main problem with dill / cloudpickle is that it has limitations on the
Python version / library version same effect with pickle but I think it
depends on the protocol version.



On Mon, 4 Dec 2023 at 22:15, Jarek Potiuk <ja...@potiuk.com> wrote:

> Let me start separate threads from the main discussion. It will be easier
> to follow.
>
> I'd really like to understand the actors involved and what kind of threats
> the allowlist is supposed to protect us.
>
> I think it would be great if we see the "attack" scenario we think
> allowlist will protect us against - explaining the actors involved and what
> actions they could make that the allowlist prevent.
>
> It might be that I do not see it (but as experiences from our security team
> had shown - having a clear step-by-step explanation of what happens allows
> us to reason better about it and see if the protection is actually
> "functional". I think for now it's not doing much (except making it a bit
> more harder for someone who would like to exploit it) , but I might be
> mistaken (happened many times during our security discussions.
>
>
> Pickle and the likes can execute arbitrary code that is inside the
> > serialized object.
> >
>
> Yep. This is super dangerous indeed.
>
>
> > Airflow's serializers do not execute arbitrary code. Also your assumption
> > of the 'serious' security problem is incorrect.
>
>
> I think that's a misunderstanding. I simply think that whatever we are
> trying to protect against is simply not effective because it might be
> bypassed easily by DAG authors (if this is the DAG authors action we are
> trying to protect against). I am not sure if it is serious or not because I
> am not sure what we are trying to protect against.
>
> But I believe whatever we do might simply not work as intended and even
> more - I think it's quite unnecessary.
>
> So I want to know the intentions, actors and scenarios :).
>
> I think clearly spelling out actors involved and scenario of the attack we
> are trying to protect against should help with clarifying it. I think it's
> been hinted at in the next few paragraphs, but I wanted to make sure that I
> understand it.
>
>
> > The deserializer will need
> > to be able to load the class by using 'import_string' after verifying
> that
> > it matches the allowed pattern (by default 'airflow.*'). If a DAG author
> > does something like `airflow.providers.bla.bla` that class cannot be
> loaded
> > neither can the DAG author make that class available to a different DAG
> in
> > a different context (which is where a real attack vector would be).
> >
>
> So do I understand correctly that this one is aimed to protect against one
> DAG author (writing DAGA and storing data to XCom to not to allow it to get
> that XCom in DAG B and execute a code that  DAG A author wanted DAG B to
> execute during deserialization?
>
> In other words - we want to make sure that DAGS are somewhat  isolated from
> each other?
>
> For now we are ignoring the fact that tasks for different DAGs are
> executed potentially on the same worker, in a fork of the same process,
> which has potentially other ways where one task influences what other tasks
> are doing. This is a different scenario / layer and I do understand
> the concept of layered protection but wanted to make sure that it's been a
> deliberate consideration.
>
> Is this a fair assesment?
>
> Are there any other reasons for that allowlist ?
>
> Finally - did we consider the potential atta scenario where importing
> subpackagest might come from multiple locations on PYTHONPATH (I am not
> sure it is possible to exploit in our case but as of
> https://peps.python.org/pep-0420/ you can  import sub-packages of packages
> from several locations) ?
>
> An attack vector is considered (as mentioned above) if you can change
> > something in someone else's context. So, if I would be able to serialize
> > something and then influence / adjust its deserialization in a different
> > context which should not be under my control.
> >
>
> Yeah. I think here the "my control" and "others" and "different context"
> (i.e. actors involved) were not clear to me (still not entirely).
>
>  We have a security model (just updated it to clarify some of the DAG
> author expectations) where currently we do not distinguish between
> different DAG authors - we assume that all DAG Authors have access to the
> same execution environment and generally speaking we cannot assume that
> other DAG authors are isolated from each other. It's just impossible
> (currently) for multiple reasons.
>
> We are working on changing it with AIP-44 and one more not-yet-written-aip,
> and maybe it's good that this serialisation is good to  add extra
> additional layer of security, but I think it has the potential of giving
> our users false sense of security - if we do not properly describe it in
> the model. So we need to be very clear when we are describing the intended
> level of isolation provided.
>
>
> J.
>