You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Dian Fu <di...@gmail.com> on 2020/04/01 02:49:42 UTC

[DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Hi everyone,

I'd like to start a discussion about supporting conversion between PyFlink Table and Pandas DataFrame.

Pandas dataframe is the de-facto standard to work with tabular data in Python community. PyFlink table is Flink’s representation of the tabular data in Python language. It would be nice to provide the functionality to convert between the PyFlink table and Pandas dataframe in PyFlink Table API. It provides users the ability to switch between PyFlink and Pandas seamlessly when processing data in Python language without an extra intermediate connectors.

Jincheng Sun and I have discussed offline and have drafted the FLIP-120[1]. Looking forward to your feedback!

Regards,
Dian

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame

Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Posted by Dian Fu <di...@gmail.com>.
Thanks you all for the discussion. It seems that we have reached consensus on the design. I will start a VOTE thread if there are no other feedbacks.

Regards,
Dian

> 在 2020年4月3日,下午2:58,Wei Zhong <we...@gmail.com> 写道:
> 
> Hi Dian,
> 
> Thanks for driving this. Big +1 for supporting from/to pandas in PyFlink!
> 
> Best,
> Wei
> 
>> 在 2020年4月3日,13:46,jincheng sun <su...@gmail.com> 写道:
>> 
>> +1, Thanks for bring up this discussion @Dian Fu <di...@gmail.com>
>> 
>> Best,
>> Jincheng
>> 
>> 
>> Jeff Zhang <zj...@gmail.com> 于2020年4月1日周三 下午1:27写道:
>> 
>>> Thanks for the reply, Dian, that make sense to me.
>>> 
>>> Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午11:53写道:
>>> 
>>>> Hi Jeff,
>>>> 
>>>> Thanks for your feedback.
>>>> 
>>>> ArrowTableSink is a Flink sink which is responsible for collecting the
>>>> data of the table. It will serialize the data of the table to Arrow
>>> format
>>>> to make sure that it could be deserialized to pandas dataframe
>>> efficiently.
>>>> You are right that pandas dataframe is constructed at the client side and
>>>> so there needs a way to transfer the table data from the ArrowTableSink
>>> to
>>>> the client. It shares the same design as Table.collect on how to transfer
>>>> the data to the client. This is still under lively discussion in
>>>> FLINK-14807. I think we can discuss it there on this aspect and so it's
>>> not
>>>> touched in this design(already mentioned in the design doc). Then we can
>>>> focus on table/dataframe conversion in this design. Does that make sense
>>> to
>>>> you?
>>>> 
>>>> Thanks,
>>>> Dian
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-14807 <
>>>> https://issues.apache.org/jira/browse/FLINK-14807>
>>>>> 在 2020年4月1日,上午11:14,Jeff Zhang <zj...@gmail.com> 写道:
>>>>> 
>>>>> Thanks Dian for driving this, definitely +1
>>>>> 
>>>>> Here's my 2 cents:
>>>>> 
>>>>> 1. I would pay more attention on to_pandas than from_pandas.  Because
>>>>> to_pandas will be used more frequently I believe
>>>>> 2. I think ArrowTableSink may not be enough for to_pandas, because
>>> pandas
>>>>> dataframe is on client side, it is not a table sink. We still need to
>>>>> convert ArrowTableSink to pandas dataframe if I understand correctly.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午10:49写道:
>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> I'd like to start a discussion about supporting conversion between
>>>> PyFlink
>>>>>> Table and Pandas DataFrame.
>>>>>> 
>>>>>> Pandas dataframe is the de-facto standard to work with tabular data in
>>>>>> Python community. PyFlink table is Flink’s representation of the
>>> tabular
>>>>>> data in Python language. It would be nice to provide the functionality
>>>> to
>>>>>> convert between the PyFlink table and Pandas dataframe in PyFlink
>>> Table
>>>>>> API. It provides users the ability to switch between PyFlink and
>>> Pandas
>>>>>> seamlessly when processing data in Python language without an extra
>>>>>> intermediate connectors.
>>>>>> 
>>>>>> Jincheng Sun and I have discussed offline and have drafted the
>>>>>> FLIP-120[1]. Looking forward to your feedback!
>>>>>> 
>>>>>> Regards,
>>>>>> Dian
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best Regards
>>>>> 
>>>>> Jeff Zhang
>>>> 
>>>> 
>>> 
>>> --
>>> Best Regards
>>> 
>>> Jeff Zhang
>>> 
> 


Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Posted by Wei Zhong <we...@gmail.com>.
Hi Dian,

Thanks for driving this. Big +1 for supporting from/to pandas in PyFlink!

Best,
Wei

> 在 2020年4月3日,13:46,jincheng sun <su...@gmail.com> 写道:
> 
> +1, Thanks for bring up this discussion @Dian Fu <di...@gmail.com>
> 
> Best,
> Jincheng
> 
> 
> Jeff Zhang <zj...@gmail.com> 于2020年4月1日周三 下午1:27写道:
> 
>> Thanks for the reply, Dian, that make sense to me.
>> 
>> Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午11:53写道:
>> 
>>> Hi Jeff,
>>> 
>>> Thanks for your feedback.
>>> 
>>> ArrowTableSink is a Flink sink which is responsible for collecting the
>>> data of the table. It will serialize the data of the table to Arrow
>> format
>>> to make sure that it could be deserialized to pandas dataframe
>> efficiently.
>>> You are right that pandas dataframe is constructed at the client side and
>>> so there needs a way to transfer the table data from the ArrowTableSink
>> to
>>> the client. It shares the same design as Table.collect on how to transfer
>>> the data to the client. This is still under lively discussion in
>>> FLINK-14807. I think we can discuss it there on this aspect and so it's
>> not
>>> touched in this design(already mentioned in the design doc). Then we can
>>> focus on table/dataframe conversion in this design. Does that make sense
>> to
>>> you?
>>> 
>>> Thanks,
>>> Dian
>>> 
>>> [1] https://issues.apache.org/jira/browse/FLINK-14807 <
>>> https://issues.apache.org/jira/browse/FLINK-14807>
>>>> 在 2020年4月1日,上午11:14,Jeff Zhang <zj...@gmail.com> 写道:
>>>> 
>>>> Thanks Dian for driving this, definitely +1
>>>> 
>>>> Here's my 2 cents:
>>>> 
>>>> 1. I would pay more attention on to_pandas than from_pandas.  Because
>>>> to_pandas will be used more frequently I believe
>>>> 2. I think ArrowTableSink may not be enough for to_pandas, because
>> pandas
>>>> dataframe is on client side, it is not a table sink. We still need to
>>>> convert ArrowTableSink to pandas dataframe if I understand correctly.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午10:49写道:
>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> I'd like to start a discussion about supporting conversion between
>>> PyFlink
>>>>> Table and Pandas DataFrame.
>>>>> 
>>>>> Pandas dataframe is the de-facto standard to work with tabular data in
>>>>> Python community. PyFlink table is Flink’s representation of the
>> tabular
>>>>> data in Python language. It would be nice to provide the functionality
>>> to
>>>>> convert between the PyFlink table and Pandas dataframe in PyFlink
>> Table
>>>>> API. It provides users the ability to switch between PyFlink and
>> Pandas
>>>>> seamlessly when processing data in Python language without an extra
>>>>> intermediate connectors.
>>>>> 
>>>>> Jincheng Sun and I have discussed offline and have drafted the
>>>>> FLIP-120[1]. Looking forward to your feedback!
>>>>> 
>>>>> Regards,
>>>>> Dian
>>>>> 
>>>>> [1]
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best Regards
>>>> 
>>>> Jeff Zhang
>>> 
>>> 
>> 
>> --
>> Best Regards
>> 
>> Jeff Zhang
>> 


Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Posted by jincheng sun <su...@gmail.com>.
+1, Thanks for bring up this discussion @Dian Fu <di...@gmail.com>

Best,
Jincheng


Jeff Zhang <zj...@gmail.com> 于2020年4月1日周三 下午1:27写道:

> Thanks for the reply, Dian, that make sense to me.
>
> Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午11:53写道:
>
> > Hi Jeff,
> >
> > Thanks for your feedback.
> >
> > ArrowTableSink is a Flink sink which is responsible for collecting the
> > data of the table. It will serialize the data of the table to Arrow
> format
> > to make sure that it could be deserialized to pandas dataframe
> efficiently.
> > You are right that pandas dataframe is constructed at the client side and
> > so there needs a way to transfer the table data from the ArrowTableSink
> to
> > the client. It shares the same design as Table.collect on how to transfer
> > the data to the client. This is still under lively discussion in
> > FLINK-14807. I think we can discuss it there on this aspect and so it's
> not
> > touched in this design(already mentioned in the design doc). Then we can
> > focus on table/dataframe conversion in this design. Does that make sense
> to
> > you?
> >
> > Thanks,
> > Dian
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-14807 <
> > https://issues.apache.org/jira/browse/FLINK-14807>
> > > 在 2020年4月1日,上午11:14,Jeff Zhang <zj...@gmail.com> 写道:
> > >
> > > Thanks Dian for driving this, definitely +1
> > >
> > > Here's my 2 cents:
> > >
> > > 1. I would pay more attention on to_pandas than from_pandas.  Because
> > > to_pandas will be used more frequently I believe
> > > 2. I think ArrowTableSink may not be enough for to_pandas, because
> pandas
> > > dataframe is on client side, it is not a table sink. We still need to
> > > convert ArrowTableSink to pandas dataframe if I understand correctly.
> > >
> > >
> > >
> > >
> > > Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午10:49写道:
> > >
> > >> Hi everyone,
> > >>
> > >> I'd like to start a discussion about supporting conversion between
> > PyFlink
> > >> Table and Pandas DataFrame.
> > >>
> > >> Pandas dataframe is the de-facto standard to work with tabular data in
> > >> Python community. PyFlink table is Flink’s representation of the
> tabular
> > >> data in Python language. It would be nice to provide the functionality
> > to
> > >> convert between the PyFlink table and Pandas dataframe in PyFlink
> Table
> > >> API. It provides users the ability to switch between PyFlink and
> Pandas
> > >> seamlessly when processing data in Python language without an extra
> > >> intermediate connectors.
> > >>
> > >> Jincheng Sun and I have discussed offline and have drafted the
> > >> FLIP-120[1]. Looking forward to your feedback!
> > >>
> > >> Regards,
> > >> Dian
> > >>
> > >> [1]
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
> > >
> > >
> > >
> > > --
> > > Best Regards
> > >
> > > Jeff Zhang
> >
> >
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Posted by Jeff Zhang <zj...@gmail.com>.
Thanks for the reply, Dian, that make sense to me.

Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午11:53写道:

> Hi Jeff,
>
> Thanks for your feedback.
>
> ArrowTableSink is a Flink sink which is responsible for collecting the
> data of the table. It will serialize the data of the table to Arrow format
> to make sure that it could be deserialized to pandas dataframe efficiently.
> You are right that pandas dataframe is constructed at the client side and
> so there needs a way to transfer the table data from the ArrowTableSink to
> the client. It shares the same design as Table.collect on how to transfer
> the data to the client. This is still under lively discussion in
> FLINK-14807. I think we can discuss it there on this aspect and so it's not
> touched in this design(already mentioned in the design doc). Then we can
> focus on table/dataframe conversion in this design. Does that make sense to
> you?
>
> Thanks,
> Dian
>
> [1] https://issues.apache.org/jira/browse/FLINK-14807 <
> https://issues.apache.org/jira/browse/FLINK-14807>
> > 在 2020年4月1日,上午11:14,Jeff Zhang <zj...@gmail.com> 写道:
> >
> > Thanks Dian for driving this, definitely +1
> >
> > Here's my 2 cents:
> >
> > 1. I would pay more attention on to_pandas than from_pandas.  Because
> > to_pandas will be used more frequently I believe
> > 2. I think ArrowTableSink may not be enough for to_pandas, because pandas
> > dataframe is on client side, it is not a table sink. We still need to
> > convert ArrowTableSink to pandas dataframe if I understand correctly.
> >
> >
> >
> >
> > Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午10:49写道:
> >
> >> Hi everyone,
> >>
> >> I'd like to start a discussion about supporting conversion between
> PyFlink
> >> Table and Pandas DataFrame.
> >>
> >> Pandas dataframe is the de-facto standard to work with tabular data in
> >> Python community. PyFlink table is Flink’s representation of the tabular
> >> data in Python language. It would be nice to provide the functionality
> to
> >> convert between the PyFlink table and Pandas dataframe in PyFlink Table
> >> API. It provides users the ability to switch between PyFlink and Pandas
> >> seamlessly when processing data in Python language without an extra
> >> intermediate connectors.
> >>
> >> Jincheng Sun and I have discussed offline and have drafted the
> >> FLIP-120[1]. Looking forward to your feedback!
> >>
> >> Regards,
> >> Dian
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
>
>

-- 
Best Regards

Jeff Zhang

Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Posted by Dian Fu <di...@gmail.com>.
Hi Jeff,

Thanks for your feedback.

ArrowTableSink is a Flink sink which is responsible for collecting the data of the table. It will serialize the data of the table to Arrow format to make sure that it could be deserialized to pandas dataframe efficiently. You are right that pandas dataframe is constructed at the client side and so there needs a way to transfer the table data from the ArrowTableSink to the client. It shares the same design as Table.collect on how to transfer the data to the client. This is still under lively discussion in FLINK-14807. I think we can discuss it there on this aspect and so it's not touched in this design(already mentioned in the design doc). Then we can focus on table/dataframe conversion in this design. Does that make sense to you?

Thanks,
Dian

[1] https://issues.apache.org/jira/browse/FLINK-14807 <https://issues.apache.org/jira/browse/FLINK-14807>
> 在 2020年4月1日,上午11:14,Jeff Zhang <zj...@gmail.com> 写道:
> 
> Thanks Dian for driving this, definitely +1
> 
> Here's my 2 cents:
> 
> 1. I would pay more attention on to_pandas than from_pandas.  Because
> to_pandas will be used more frequently I believe
> 2. I think ArrowTableSink may not be enough for to_pandas, because pandas
> dataframe is on client side, it is not a table sink. We still need to
> convert ArrowTableSink to pandas dataframe if I understand correctly.
> 
> 
> 
> 
> Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午10:49写道:
> 
>> Hi everyone,
>> 
>> I'd like to start a discussion about supporting conversion between PyFlink
>> Table and Pandas DataFrame.
>> 
>> Pandas dataframe is the de-facto standard to work with tabular data in
>> Python community. PyFlink table is Flink’s representation of the tabular
>> data in Python language. It would be nice to provide the functionality to
>> convert between the PyFlink table and Pandas dataframe in PyFlink Table
>> API. It provides users the ability to switch between PyFlink and Pandas
>> seamlessly when processing data in Python language without an extra
>> intermediate connectors.
>> 
>> Jincheng Sun and I have discussed offline and have drafted the
>> FLIP-120[1]. Looking forward to your feedback!
>> 
>> Regards,
>> Dian
>> 
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Posted by Jeff Zhang <zj...@gmail.com>.
Thanks Dian for driving this, definitely +1

Here's my 2 cents:

1. I would pay more attention on to_pandas than from_pandas.  Because
to_pandas will be used more frequently I believe
2. I think ArrowTableSink may not be enough for to_pandas, because pandas
dataframe is on client side, it is not a table sink. We still need to
convert ArrowTableSink to pandas dataframe if I understand correctly.




Dian Fu <di...@gmail.com> 于2020年4月1日周三 上午10:49写道:

> Hi everyone,
>
> I'd like to start a discussion about supporting conversion between PyFlink
> Table and Pandas DataFrame.
>
> Pandas dataframe is the de-facto standard to work with tabular data in
> Python community. PyFlink table is Flink’s representation of the tabular
> data in Python language. It would be nice to provide the functionality to
> convert between the PyFlink table and Pandas dataframe in PyFlink Table
> API. It provides users the ability to switch between PyFlink and Pandas
> seamlessly when processing data in Python language without an extra
> intermediate connectors.
>
> Jincheng Sun and I have discussed offline and have drafted the
> FLIP-120[1]. Looking forward to your feedback!
>
> Regards,
> Dian
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame



-- 
Best Regards

Jeff Zhang