You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Kaxil Naik <ka...@apache.org> on 2020/09/18 12:01:16 UTC

[DISCUSS] Removing Pickling from Airflow 2.0

Hi all,

We briefly discussed how pickling is currently used in Airflow codebase and
whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call this
Monday.

Currently, AFAIK only *CeleryExecutor* supports pickling (code
<https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126>).
We also have a flag on *airflow scheduler
<https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI
command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run
<https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command.

If we want to remove pickling, I think Airflow 2.0 is the right time.

We have also deprecated the use of pickling in XComs.

https://docs.python.org/3/library/pickle.html -- lists some items on the
security implications of pickle and comparisons with JSON.

Another alternative is using *cloudpickle
<https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead
of *pickle,
*it suffers from the same security issues like *pickle *but does have some
more features compared to pickle.

What do you all think?

Regards,
Kaxil

Re: [DISCUSS] Removing Pickling from Airflow 2.0

Posted by Kaxil Naik <ka...@gmail.com>.
Thank You for the thoughts. I will create a VOTE thread at some point today
/ early tomorrow.

Regards,
Kaxil

On Mon, Sep 21, 2020 at 6:11 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> Happy to kill pickling for 2.0. While starting to review the HA, I see
> that more and more we rely on Serialization and there are some rather
> weird-looking left-overs in Ash's change that are only there because of
> pickling.
>
> I think we already know that Serialization becomes a first-class citizen
> in Airflow 2.0. And while we know the first versions of serialization had
> some teething problems - most of which have been already addressed (the
> most interesting one was few orders of magnitude increase in outbound
> traffic from the Airflow to the DB - but it's already fixed I believe).
> If we think that what pickling was used for can be handled entirely by
> serialization, I am all for killing pickling and rather than that focus
> 100% on serialization improvements, testing, and making it rock solid.
>
> J.
>
> On Fri, Sep 18, 2020 at 11:59 PM Daniel Imberman <
> daniel.imberman@gmail.com> wrote:
>
>> Are there any use-cases that REQUIRE pickle? Do we have any sense of what
>> % of the Airflow community depends on Pickle? I’m all for killing it if
>> possible but I want to make sure we’re not setting up a major hurdle for
>> migration.
>>
>> via Newton Mail [
>> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
>> ]
>> On Fri, Sep 18, 2020 at 2:50 PM, Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>> I'm getting bad flashbacks of fighting with pickles early on in the
>> history
>> of the project. I've learned since then to stay away. Almost all solutions
>> that involve pickles are bad solutions. Beyond but related to the security
>> implication are the issues of pickle entanglement, not really knowing
>> what's in the pickle and how big it might get, and how it may affect the
>> environment it's deserialized into.
>>
>> 2.0 is a great time to kill pickles with fire.
>>
>> On Fri, Sep 18, 2020 at 5:01 AM Kaxil Naik <ka...@apache.org> wrote:
>>
>> > Hi all,
>> >
>> > We briefly discussed how pickling is currently used in Airflow codebase
>> and
>> > whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call
>> this
>> > Monday.
>> >
>> > Currently, AFAIK only *CeleryExecutor* supports pickling (code
>> > <
>> >
>> https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126
>> > >).
>> > We also have a flag on *airflow scheduler
>> > <https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI
>> > command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run
>> > <https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command.
>> >
>> > If we want to remove pickling, I think Airflow 2.0 is the right time.
>> >
>> > We have also deprecated the use of pickling in XComs.
>> >
>> > https://docs.python.org/3/library/pickle.html -- lists some items on
>> the
>> > security implications of pickle and comparisons with JSON.
>> >
>> > Another alternative is using *cloudpickle
>> > <https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead
>> > of *pickle,
>> > *it suffers from the same security issues like *pickle *but does have
>> some
>> > more features compared to pickle.
>> >
>> > What do you all think?
>> >
>> > Regards,
>> > Kaxil
>> >
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

Re: [DISCUSS] Removing Pickling from Airflow 2.0

Posted by Jarek Potiuk <Ja...@polidea.com>.
Happy to kill pickling for 2.0. While starting to review the HA, I see that
more and more we rely on Serialization and there are some rather
weird-looking left-overs in Ash's change that are only there because of
pickling.

I think we already know that Serialization becomes a first-class citizen in
Airflow 2.0. And while we know the first versions of serialization had some
teething problems - most of which have been already addressed (the most
interesting one was few orders of magnitude increase in outbound traffic
from the Airflow to the DB - but it's already fixed I believe).
If we think that what pickling was used for can be handled entirely by
serialization, I am all for killing pickling and rather than that focus
100% on serialization improvements, testing, and making it rock solid.

J.

On Fri, Sep 18, 2020 at 11:59 PM Daniel Imberman <da...@gmail.com>
wrote:

> Are there any use-cases that REQUIRE pickle? Do we have any sense of what
> % of the Airflow community depends on Pickle? I’m all for killing it if
> possible but I want to make sure we’re not setting up a major hurdle for
> migration.
>
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> ]
> On Fri, Sep 18, 2020 at 2:50 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
> I'm getting bad flashbacks of fighting with pickles early on in the history
> of the project. I've learned since then to stay away. Almost all solutions
> that involve pickles are bad solutions. Beyond but related to the security
> implication are the issues of pickle entanglement, not really knowing
> what's in the pickle and how big it might get, and how it may affect the
> environment it's deserialized into.
>
> 2.0 is a great time to kill pickles with fire.
>
> On Fri, Sep 18, 2020 at 5:01 AM Kaxil Naik <ka...@apache.org> wrote:
>
> > Hi all,
> >
> > We briefly discussed how pickling is currently used in Airflow codebase
> and
> > whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call
> this
> > Monday.
> >
> > Currently, AFAIK only *CeleryExecutor* supports pickling (code
> > <
> >
> https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126
> > >).
> > We also have a flag on *airflow scheduler
> > <https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI
> > command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run
> > <https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command.
> >
> > If we want to remove pickling, I think Airflow 2.0 is the right time.
> >
> > We have also deprecated the use of pickling in XComs.
> >
> > https://docs.python.org/3/library/pickle.html -- lists some items on the
> > security implications of pickle and comparisons with JSON.
> >
> > Another alternative is using *cloudpickle
> > <https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead
> > of *pickle,
> > *it suffers from the same security issues like *pickle *but does have
> some
> > more features compared to pickle.
> >
> > What do you all think?
> >
> > Regards,
> > Kaxil
> >



-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Removing Pickling from Airflow 2.0

Posted by Daniel Imberman <da...@gmail.com>.
Are there any use-cases that REQUIRE pickle? Do we have any sense of what % of the Airflow community depends on Pickle? I’m all for killing it if possible but I want to make sure we’re not setting up a major hurdle for migration.

via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
On Fri, Sep 18, 2020 at 2:50 PM, Maxime Beauchemin <ma...@gmail.com> wrote:
I'm getting bad flashbacks of fighting with pickles early on in the history
of the project. I've learned since then to stay away. Almost all solutions
that involve pickles are bad solutions. Beyond but related to the security
implication are the issues of pickle entanglement, not really knowing
what's in the pickle and how big it might get, and how it may affect the
environment it's deserialized into.

2.0 is a great time to kill pickles with fire.

On Fri, Sep 18, 2020 at 5:01 AM Kaxil Naik <ka...@apache.org> wrote:

> Hi all,
>
> We briefly discussed how pickling is currently used in Airflow codebase and
> whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call this
> Monday.
>
> Currently, AFAIK only *CeleryExecutor* supports pickling (code
> <
> https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126
> >).
> We also have a flag on *airflow scheduler
> <https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI
> command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run
> <https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command.
>
> If we want to remove pickling, I think Airflow 2.0 is the right time.
>
> We have also deprecated the use of pickling in XComs.
>
> https://docs.python.org/3/library/pickle.html -- lists some items on the
> security implications of pickle and comparisons with JSON.
>
> Another alternative is using *cloudpickle
> <https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead
> of *pickle,
> *it suffers from the same security issues like *pickle *but does have some
> more features compared to pickle.
>
> What do you all think?
>
> Regards,
> Kaxil
>

Re: [DISCUSS] Removing Pickling from Airflow 2.0

Posted by Maxime Beauchemin <ma...@gmail.com>.
I'm getting bad flashbacks of fighting with pickles early on in the history
of the project. I've learned since then to stay away. Almost all solutions
that involve pickles are bad solutions. Beyond but related to the security
implication are the issues of pickle entanglement, not really knowing
what's in the pickle and how big it might get, and how it may affect the
environment it's deserialized into.

2.0 is a great time to kill pickles with fire.

On Fri, Sep 18, 2020 at 5:01 AM Kaxil Naik <ka...@apache.org> wrote:

> Hi all,
>
> We briefly discussed how pickling is currently used in Airflow codebase and
> whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call this
> Monday.
>
> Currently, AFAIK only *CeleryExecutor* supports pickling (code
> <
> https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126
> >).
> We also have a flag on *airflow scheduler
> <https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI
> command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run
> <https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command.
>
> If we want to remove pickling, I think Airflow 2.0 is the right time.
>
> We have also deprecated the use of pickling in XComs.
>
> https://docs.python.org/3/library/pickle.html -- lists some items on the
> security implications of pickle and comparisons with JSON.
>
> Another alternative is using *cloudpickle
> <https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead
> of *pickle,
> *it suffers from the same security issues like *pickle *but does have some
> more features compared to pickle.
>
> What do you all think?
>
> Regards,
> Kaxil
>