You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by David Muñoz <da...@gmail.com> on 2019/12/24 07:02:37 UTC

[AirFlow]: Pandas DataFrame Between Tasks

Hi,

Excuse me, I am new to this and maybe this topic has already been treated.

I would like to know if there is a way to "share/pass" pandas dataframes
between tasks in airflow.

Any help would be appreciated.

Thank you!!!

David.

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Beau Barker <be...@gmail.com>.

As Deng mentioned, consider combining the operators.

The Airflow documentation used to say, “if you need to use data between tasks, consider combining them into a single operator. But if you must have separate tasks, there is xcom.”

Sent from my iPhone


> On 26 Dec 2019, at 4:12 am, Anton Zayniev <an...@gmail.com> wrote:
> 
> Maybe the simpliest solution would be generating a temp csv file from
> pandas, pass it's path through xcom to next task. To make it idempotent you
> can dynamically generate filename to avoid collisions.
> 
>> On Wed, Dec 25, 2019, 16:55 Jarek Potiuk <Ja...@polidea.com> wrote:
>> 
>> I think it really depends what kind of data, what size, which frequency you
>> are going to use it for and what will be the use pattern. It's best to make
>> a conscious choice based on knowing the options you have :).
>> 
>> There are a number of options on top of the mentioned above. From what I
>> hear - Avro becomes more and more popular - most of the services (like BQ
>> and others) support it.  Also Parquet is an interesting one and natively
>> supported by Panda.
>> 
>> There are some converters that can be used to convert between different
>> formats (for example https://github.com/ynqa/pandavro for panda<>avro or
>> "to_parquet" method built in panda itself:
>> 
>> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
>> ).
>> Avro is record based (like CSV) with nested data capability, where Parquet
>> is column based (where set of columns can change over time).
>> 
>> But those are just a few examples and it's up to you to choose the right
>> approach for you, so here are some articles to explore:
>> 
>>   - Here you can find nice comparison/benchmark of different formats for
>>   Panda serialisation
>> 
>> https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
>>   - Also nice explanation in SO what are the benefits of using Parquet:
>> 
>> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>>   - And finally very nice article describing different types of file
>>   formats (record, column, nested, hierarchical, array, model...) -
>> including
>>   comparisons and properties of each type:
>> 
>> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>> 
>> 
>> J.
>> 
>> 
>> 
>> 
>> On Tue, Dec 24, 2019 at 10:50 AM Deng Xiaodong <xd...@gmail.com>
>> wrote:
>> 
>>> Yep, exactly what I suggested below.
>>> 
>>> In terms of format, Feather (suggested by Robin below) should be favoured
>>> over .csv given it persists schema as well.
>>> 
>>> 
>>> XD
>>> 
>>> On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek <
>> tomasz.urbaszek@polidea.com
>>>> 
>>> wrote:
>>> 
>>>> Personally I would use a .csv format and store the file on a S3/GCS
>>> bucket.
>>>> Xcom is meant to store small amount of data.
>>>> 
>>>> T.
>>>> 
>>>> On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <ro...@bidnamic.com>
>> wrote:
>>>> 
>>>>> Feather is probably a good option for data frames:
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
>>>>> 
>>>>> R
>>>>> 
>>>>> On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> Hi David.
>>>>>> 
>>>>>> The only “out of box” way to share data/information between tasks
>> is
>>>>> XCom (
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
>>>>> ).
>>>>>> 
>>>>>> For you case, the quick suggestion I can share is
>>>>>> 
>>>>>> - either merging your tasks
>>>>>> - or persisting your Pandas Dataframes somewhere then load it in
>> your
>>>> 2nd
>>>>>> task (e.g. using pickle)
>>>>>> 
>>>>>> 
>>>>>> XD
>>>>>> 
>>>>>> On Tue, Dec 24, 2019 at 15:00 David Muñoz <
>> david.munoz4185@gmail.com
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Excuse me, I am new to this and maybe this topic has already been
>>>>> treated.
>>>>>>> 
>>>>>>> I would like to know if there is a way to "share/pass" pandas
>>>>> dataframes
>>>>>>> between tasks in airflow.
>>>>>>> 
>>>>>>> Any help would be appreciated.
>>>>>>> 
>>>>>>> Thank you!!!
>>>>>>> 
>>>>>>> David.
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> Tomasz Urbaszek
>>>> Polidea <https://www.polidea.com/> | Software Engineer
>>>> 
>>>> M: +48 505 628 493 <+48505628493>
>>>> E: tomasz.urbaszek@polidea.com <to...@polidea.com>
>>>> 
>>>> Unique Tech
>>>> Check out our projects! <https://www.polidea.com/our-work>
>>>> 
>>> 
>> 
>> 
>> --
>> 
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>> 
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Anton Zayniev <an...@gmail.com>.

Maybe the simpliest solution would be generating a temp csv file from
pandas, pass it's path through xcom to next task. To make it idempotent you
can dynamically generate filename to avoid collisions.

On Wed, Dec 25, 2019, 16:55 Jarek Potiuk <Ja...@polidea.com> wrote:

> I think it really depends what kind of data, what size, which frequency you
> are going to use it for and what will be the use pattern. It's best to make
> a conscious choice based on knowing the options you have :).
>
> There are a number of options on top of the mentioned above. From what I
> hear - Avro becomes more and more popular - most of the services (like BQ
> and others) support it.  Also Parquet is an interesting one and natively
> supported by Panda.
>
> There are some converters that can be used to convert between different
> formats (for example https://github.com/ynqa/pandavro for panda<>avro or
> "to_parquet" method built in panda itself:
>
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
> ).
> Avro is record based (like CSV) with nested data capability, where Parquet
> is column based (where set of columns can change over time).
>
> But those are just a few examples and it's up to you to choose the right
> approach for you, so here are some articles to explore:
>
>    - Here you can find nice comparison/benchmark of different formats for
>    Panda serialisation
>
> https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
>    - Also nice explanation in SO what are the benefits of using Parquet:
>
> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>    - And finally very nice article describing different types of file
>    formats (record, column, nested, hierarchical, array, model...) -
> including
>    comparisons and properties of each type:
>
> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>
>
> J.
>
>
>
>
> On Tue, Dec 24, 2019 at 10:50 AM Deng Xiaodong <xd...@gmail.com>
> wrote:
>
> > Yep, exactly what I suggested below.
> >
> > In terms of format, Feather (suggested by Robin below) should be favoured
> > over .csv given it persists schema as well.
> >
> >
> > XD
> >
> > On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek <
> tomasz.urbaszek@polidea.com
> > >
> > wrote:
> >
> > > Personally I would use a .csv format and store the file on a S3/GCS
> > bucket.
> > > Xcom is meant to store small amount of data.
> > >
> > > T.
> > >
> > > On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <ro...@bidnamic.com>
> wrote:
> > >
> > > > Feather is probably a good option for data frames:
> > > >
> > > >
> > > >
> > >
> >
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
> > > >
> > > > R
> > > >
> > > > On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi David.
> > > > >
> > > > > The only “out of box” way to share data/information between tasks
> is
> > > > XCom (
> > > > >
> > > >
> > >
> >
> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
> > > > ).
> > > > >
> > > > > For you case, the quick suggestion I can share is
> > > > >
> > > > > - either merging your tasks
> > > > > - or persisting your Pandas Dataframes somewhere then load it in
> your
> > > 2nd
> > > > > task (e.g. using pickle)
> > > > >
> > > > >
> > > > > XD
> > > > >
> > > > > On Tue, Dec 24, 2019 at 15:00 David Muñoz <
> david.munoz4185@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Excuse me, I am new to this and maybe this topic has already been
> > > > treated.
> > > > > >
> > > > > > I would like to know if there is a way to "share/pass" pandas
> > > > dataframes
> > > > > > between tasks in airflow.
> > > > > >
> > > > > > Any help would be appreciated.
> > > > > >
> > > > > > Thank you!!!
> > > > > >
> > > > > > David.
> > > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Tomasz Urbaszek
> > > Polidea <https://www.polidea.com/> | Software Engineer
> > >
> > > M: +48 505 628 493 <+48505628493>
> > > E: tomasz.urbaszek@polidea.com <to...@polidea.com>
> > >
> > > Unique Tech
> > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Jarek Potiuk <Ja...@polidea.com>.

I think it really depends what kind of data, what size, which frequency you
are going to use it for and what will be the use pattern. It's best to make
a conscious choice based on knowing the options you have :).

There are a number of options on top of the mentioned above. From what I
hear - Avro becomes more and more popular - most of the services (like BQ
and others) support it.  Also Parquet is an interesting one and natively
supported by Panda.

There are some converters that can be used to convert between different
formats (for example https://github.com/ynqa/pandavro for panda<>avro or
"to_parquet" method built in panda itself:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
).
Avro is record based (like CSV) with nested data capability, where Parquet
is column based (where set of columns can change over time).

But those are just a few examples and it's up to you to choose the right
approach for you, so here are some articles to explore:

   - Here you can find nice comparison/benchmark of different formats for
   Panda serialisation
   https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
   - Also nice explanation in SO what are the benefits of using Parquet:
   https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
   - And finally very nice article describing different types of file
   formats (record, column, nested, hierarchical, array, model...) - including
   comparisons and properties of each type:
   https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats


J.




On Tue, Dec 24, 2019 at 10:50 AM Deng Xiaodong <xd...@gmail.com> wrote:

> Yep, exactly what I suggested below.
>
> In terms of format, Feather (suggested by Robin below) should be favoured
> over .csv given it persists schema as well.
>
>
> XD
>
> On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek <tomasz.urbaszek@polidea.com
> >
> wrote:
>
> > Personally I would use a .csv format and store the file on a S3/GCS
> bucket.
> > Xcom is meant to store small amount of data.
> >
> > T.
> >
> > On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <ro...@bidnamic.com> wrote:
> >
> > > Feather is probably a good option for data frames:
> > >
> > >
> > >
> >
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
> > >
> > > R
> > >
> > > On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd...@gmail.com>
> wrote:
> > > >
> > > > Hi David.
> > > >
> > > > The only “out of box” way to share data/information between tasks is
> > > XCom (
> > > >
> > >
> >
> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
> > > ).
> > > >
> > > > For you case, the quick suggestion I can share is
> > > >
> > > > - either merging your tasks
> > > > - or persisting your Pandas Dataframes somewhere then load it in your
> > 2nd
> > > > task (e.g. using pickle)
> > > >
> > > >
> > > > XD
> > > >
> > > > On Tue, Dec 24, 2019 at 15:00 David Muñoz <david.munoz4185@gmail.com
> >
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Excuse me, I am new to this and maybe this topic has already been
> > > treated.
> > > > >
> > > > > I would like to know if there is a way to "share/pass" pandas
> > > dataframes
> > > > > between tasks in airflow.
> > > > >
> > > > > Any help would be appreciated.
> > > > >
> > > > > Thank you!!!
> > > > >
> > > > > David.
> > > > >
> > >
> >
> >
> > --
> >
> > Tomasz Urbaszek
> > Polidea <https://www.polidea.com/> | Software Engineer
> >
> > M: +48 505 628 493 <+48505628493>
> > E: tomasz.urbaszek@polidea.com <to...@polidea.com>
> >
> > Unique Tech
> > Check out our projects! <https://www.polidea.com/our-work>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Deng Xiaodong <xd...@gmail.com>.

Yep, exactly what I suggested below.

In terms of format, Feather (suggested by Robin below) should be favoured
over .csv given it persists schema as well.


XD

On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek <to...@polidea.com>
wrote:

> Personally I would use a .csv format and store the file on a S3/GCS bucket.
> Xcom is meant to store small amount of data.
>
> T.
>
> On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <ro...@bidnamic.com> wrote:
>
> > Feather is probably a good option for data frames:
> >
> >
> >
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
> >
> > R
> >
> > On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd...@gmail.com> wrote:
> > >
> > > Hi David.
> > >
> > > The only “out of box” way to share data/information between tasks is
> > XCom (
> > >
> >
> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
> > ).
> > >
> > > For you case, the quick suggestion I can share is
> > >
> > > - either merging your tasks
> > > - or persisting your Pandas Dataframes somewhere then load it in your
> 2nd
> > > task (e.g. using pickle)
> > >
> > >
> > > XD
> > >
> > > On Tue, Dec 24, 2019 at 15:00 David Muñoz <da...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Excuse me, I am new to this and maybe this topic has already been
> > treated.
> > > >
> > > > I would like to know if there is a way to "share/pass" pandas
> > dataframes
> > > > between tasks in airflow.
> > > >
> > > > Any help would be appreciated.
> > > >
> > > > Thank you!!!
> > > >
> > > > David.
> > > >
> >
>
>
> --
>
> Tomasz Urbaszek
> Polidea <https://www.polidea.com/> | Software Engineer
>
> M: +48 505 628 493 <+48505628493>
> E: tomasz.urbaszek@polidea.com <to...@polidea.com>
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Tomasz Urbaszek <to...@polidea.com>.

Personally I would use a .csv format and store the file on a S3/GCS bucket.
Xcom is meant to store small amount of data.

T.

On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <ro...@bidnamic.com> wrote:

> Feather is probably a good option for data frames:
>
>
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
>
> R
>
> On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd...@gmail.com> wrote:
> >
> > Hi David.
> >
> > The only “out of box” way to share data/information between tasks is
> XCom (
> >
> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
> ).
> >
> > For you case, the quick suggestion I can share is
> >
> > - either merging your tasks
> > - or persisting your Pandas Dataframes somewhere then load it in your 2nd
> > task (e.g. using pickle)
> >
> >
> > XD
> >
> > On Tue, Dec 24, 2019 at 15:00 David Muñoz <da...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > Excuse me, I am new to this and maybe this topic has already been
> treated.
> > >
> > > I would like to know if there is a way to "share/pass" pandas
> dataframes
> > > between tasks in airflow.
> > >
> > > Any help would be appreciated.
> > >
> > > Thank you!!!
> > >
> > > David.
> > >
>


-- 

Tomasz Urbaszek
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 505 628 493 <+48505628493>
E: tomasz.urbaszek@polidea.com <to...@polidea.com>

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Robin Edwards <ro...@bidnamic.com>.

Feather is probably a good option for data frames:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html

R

On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd...@gmail.com> wrote:
>
> Hi David.
>
> The only “out of box” way to share data/information between tasks is XCom (
> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms).
>
> For you case, the quick suggestion I can share is
>
> - either merging your tasks
> - or persisting your Pandas Dataframes somewhere then load it in your 2nd
> task (e.g. using pickle)
>
>
> XD
>
> On Tue, Dec 24, 2019 at 15:00 David Muñoz <da...@gmail.com> wrote:
>
> > Hi,
> >
> > Excuse me, I am new to this and maybe this topic has already been treated.
> >
> > I would like to know if there is a way to "share/pass" pandas dataframes
> > between tasks in airflow.
> >
> > Any help would be appreciated.
> >
> > Thank you!!!
> >
> > David.
> >

Re: [AirFlow]: Pandas DataFrame Between Tasks

Posted by Deng Xiaodong <xd...@gmail.com>.

Hi David.

The only “out of box” way to share data/information between tasks is XCom (
https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms).

For you case, the quick suggestion I can share is

- either merging your tasks
- or persisting your Pandas Dataframes somewhere then load it in your 2nd
task (e.g. using pickle)

XD

On Tue, Dec 24, 2019 at 15:00 David Muñoz <da...@gmail.com> wrote:

> Hi,
>
> Excuse me, I am new to this and maybe this topic has already been treated.
>
> I would like to know if there is a way to "share/pass" pandas dataframes
> between tasks in airflow.
>
> Any help would be appreciated.
>
> Thank you!!!
>
> David.
>