You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by jincheng sun <su...@gmail.com> on 2020/09/03 05:47:10 UTC

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Thanks for the update Xingbo!

Pandas UDAF can reuse the `class aggregate function (user defined
function)` interface in FLIP-139, and the core logic of Pandas UDAF users
is written in the `accumulate` method. In this way, we can unify the
interface semantics of all UDAF.

What do you think?

Best,
Jincheng



Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:

> Hi Jincheng,
>
> Thanks a lot for joining the discussion and the suggestion of discussing
> FLIP-137 and FLIP-139 together.
>
> >> 1. We also need to consider how pandas UDAF supports metrics, and
> whether
> we need a custom interface for pandas UDAF?
>
> Yes. We need to add an interface so that users can add some logic in the
> `open` or `close` method such as creating metrics. I have added the
> definition of the interface and the corresponding example in the doc.
>
> >> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
>
> Yes. From the overall view of Python User Defined Function, we use @udf to
> describe general python udf and pandas udf, @udtf to describe python udtf,
> and @udaf to describe general python udaf and pandas udaf, which is more
> unified. I will discuss it in FLIP-139 later.
>
> Best,
> Xingbo
>
> jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
>
> > Hi Xingbo,
> >
> > Thanks for the discussion! Overall, + 1 for this FLIP.
> > I have two points to add:
> >
> >  - We also need to consider how pandas UDAF supports metrics, and whether
> > we need a custom interface for pandas UDAF?
> >  - We have added @udaf(), so whether to use ordinary Python UDAF? If not,
> > the addition of @udaf is not appropriate. We need to discuss it further.
> >
> > We can consider it combination with FLIP-139 for design. What do you
> think?
> >
> > Best,
> > Jincheng
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> >
> > > Hi everyone,
> > >
> > > I would like to start a discussion thread on "Support Pandas UDAF in
> > > PyFlink"
> > >
> > > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves the
> > > high serialization/deserialization overhead in Python UDF and makes it
> > > convenient to leverage the popular Python libraries such as Pandas,
> > Numpy,
> > > etc. Since Pandas UDF has so many advantages, we want to support Pandas
> > > UDAF to extend usage of Pandas UDF.
> > >
> > > Dian Fu and I have discussed offline and have drafted the FLIP-137[2].
> It
> > > includes the following items:
> > >   - Support Pandas UDAF in Batch Group Aggregation
> > >   - Support Pandas UDAF in Batch Group Window Aggregation
> > >   - Support Pandas UDAF in Batch Over Window Aggregation
> > >   - Support Pandas UDAF in Stream Group Window Aggregation
> > >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> > >
> > >
> > > Looking forward to your feedback!
> > >
> > > Best,
> > > Xingbo
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> > >
> >
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Xingbo Huang <hx...@gmail.com>.
Hi everyone,

Thanks all of you for the discussion.
If there are no objections, I would like to start a vote thread tomorrow.

Best,
Xingbo

Dian Fu <di...@gmail.com> 于2020年9月3日周四 下午5:45写道:

> Thanks for preparing the FLIP, xingbo!
>
> LGTM overall and looking forward to the voting!
>
> Regards,
> Dian
>
> > 在 2020年9月3日,下午5:22,jincheng sun <su...@gmail.com> 写道:
> >
> > Thank you! looking forward to the voting :)
> >
> > Best,
> > Jincheng
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年9月3日周四 下午2:39写道:
> >
> >> Hi Jincheng,
> >>
> >> Yes, I agree that users can extend the class `AggregateFunction` if they
> >> want to define a Pandas UDAF by the way of custom classes. I have
> updated
> >> the part of the FLIP.
> >>
> >> Best,
> >> Xingbo
> >>
> >> jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:
> >>
> >>> Thanks for the update Xingbo!
> >>>
> >>> Pandas UDAF can reuse the `class aggregate function (user defined
> >>> function)` interface in FLIP-139, and the core logic of Pandas UDAF
> users
> >>> is written in the `accumulate` method. In this way, we can unify the
> >>> interface semantics of all UDAF.
> >>>
> >>> What do you think?
> >>>
> >>> Best,
> >>> Jincheng
> >>>
> >>>
> >>>
> >>> Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
> >>>
> >>>> Hi Jincheng,
> >>>>
> >>>> Thanks a lot for joining the discussion and the suggestion of
> >> discussing
> >>>> FLIP-137 and FLIP-139 together.
> >>>>
> >>>>>> 1. We also need to consider how pandas UDAF supports metrics, and
> >>>> whether
> >>>> we need a custom interface for pandas UDAF?
> >>>>
> >>>> Yes. We need to add an interface so that users can add some logic in
> >> the
> >>>> `open` or `close` method such as creating metrics. I have added the
> >>>> definition of the interface and the corresponding example in the doc.
> >>>>
> >>>>>> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
> >>>>
> >>>> Yes. From the overall view of Python User Defined Function, we use
> @udf
> >>> to
> >>>> describe general python udf and pandas udf, @udtf to describe python
> >>> udtf,
> >>>> and @udaf to describe general python udaf and pandas udaf, which is
> >> more
> >>>> unified. I will discuss it in FLIP-139 later.
> >>>>
> >>>> Best,
> >>>> Xingbo
> >>>>
> >>>> jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
> >>>>
> >>>>> Hi Xingbo,
> >>>>>
> >>>>> Thanks for the discussion! Overall, + 1 for this FLIP.
> >>>>> I have two points to add:
> >>>>>
> >>>>> - We also need to consider how pandas UDAF supports metrics, and
> >>> whether
> >>>>> we need a custom interface for pandas UDAF?
> >>>>> - We have added @udaf(), so whether to use ordinary Python UDAF? If
> >>> not,
> >>>>> the addition of @udaf is not appropriate. We need to discuss it
> >>> further.
> >>>>>
> >>>>> We can consider it combination with FLIP-139 for design. What do you
> >>>> think?
> >>>>>
> >>>>> Best,
> >>>>> Jincheng
> >>>>>
> >>>>>
> >>>>> Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> >>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I would like to start a discussion thread on "Support Pandas UDAF
> >> in
> >>>>>> PyFlink"
> >>>>>>
> >>>>>> Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
> >>> the
> >>>>>> high serialization/deserialization overhead in Python UDF and makes
> >>> it
> >>>>>> convenient to leverage the popular Python libraries such as Pandas,
> >>>>> Numpy,
> >>>>>> etc. Since Pandas UDF has so many advantages, we want to support
> >>> Pandas
> >>>>>> UDAF to extend usage of Pandas UDF.
> >>>>>>
> >>>>>> Dian Fu and I have discussed offline and have drafted the
> >>> FLIP-137[2].
> >>>> It
> >>>>>> includes the following items:
> >>>>>>  - Support Pandas UDAF in Batch Group Aggregation
> >>>>>>  - Support Pandas UDAF in Batch Group Window Aggregation
> >>>>>>  - Support Pandas UDAF in Batch Over Window Aggregation
> >>>>>>  - Support Pandas UDAF in Stream Group Window Aggregation
> >>>>>>  - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> >>>>>>
> >>>>>>
> >>>>>> Looking forward to your feedback!
> >>>>>>
> >>>>>> Best,
> >>>>>> Xingbo
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> >>>>>> [2]
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Dian Fu <di...@gmail.com>.
Thanks for preparing the FLIP, xingbo!

LGTM overall and looking forward to the voting!

Regards,
Dian

> 在 2020年9月3日,下午5:22,jincheng sun <su...@gmail.com> 写道:
> 
> Thank you! looking forward to the voting :)
> 
> Best,
> Jincheng
> 
> 
> Xingbo Huang <hx...@gmail.com> 于2020年9月3日周四 下午2:39写道:
> 
>> Hi Jincheng,
>> 
>> Yes, I agree that users can extend the class `AggregateFunction` if they
>> want to define a Pandas UDAF by the way of custom classes. I have updated
>> the part of the FLIP.
>> 
>> Best,
>> Xingbo
>> 
>> jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:
>> 
>>> Thanks for the update Xingbo!
>>> 
>>> Pandas UDAF can reuse the `class aggregate function (user defined
>>> function)` interface in FLIP-139, and the core logic of Pandas UDAF users
>>> is written in the `accumulate` method. In this way, we can unify the
>>> interface semantics of all UDAF.
>>> 
>>> What do you think?
>>> 
>>> Best,
>>> Jincheng
>>> 
>>> 
>>> 
>>> Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
>>> 
>>>> Hi Jincheng,
>>>> 
>>>> Thanks a lot for joining the discussion and the suggestion of
>> discussing
>>>> FLIP-137 and FLIP-139 together.
>>>> 
>>>>>> 1. We also need to consider how pandas UDAF supports metrics, and
>>>> whether
>>>> we need a custom interface for pandas UDAF?
>>>> 
>>>> Yes. We need to add an interface so that users can add some logic in
>> the
>>>> `open` or `close` method such as creating metrics. I have added the
>>>> definition of the interface and the corresponding example in the doc.
>>>> 
>>>>>> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
>>>> 
>>>> Yes. From the overall view of Python User Defined Function, we use @udf
>>> to
>>>> describe general python udf and pandas udf, @udtf to describe python
>>> udtf,
>>>> and @udaf to describe general python udaf and pandas udaf, which is
>> more
>>>> unified. I will discuss it in FLIP-139 later.
>>>> 
>>>> Best,
>>>> Xingbo
>>>> 
>>>> jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
>>>> 
>>>>> Hi Xingbo,
>>>>> 
>>>>> Thanks for the discussion! Overall, + 1 for this FLIP.
>>>>> I have two points to add:
>>>>> 
>>>>> - We also need to consider how pandas UDAF supports metrics, and
>>> whether
>>>>> we need a custom interface for pandas UDAF?
>>>>> - We have added @udaf(), so whether to use ordinary Python UDAF? If
>>> not,
>>>>> the addition of @udaf is not appropriate. We need to discuss it
>>> further.
>>>>> 
>>>>> We can consider it combination with FLIP-139 for design. What do you
>>>> think?
>>>>> 
>>>>> Best,
>>>>> Jincheng
>>>>> 
>>>>> 
>>>>> Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> I would like to start a discussion thread on "Support Pandas UDAF
>> in
>>>>>> PyFlink"
>>>>>> 
>>>>>> Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
>>> the
>>>>>> high serialization/deserialization overhead in Python UDF and makes
>>> it
>>>>>> convenient to leverage the popular Python libraries such as Pandas,
>>>>> Numpy,
>>>>>> etc. Since Pandas UDF has so many advantages, we want to support
>>> Pandas
>>>>>> UDAF to extend usage of Pandas UDF.
>>>>>> 
>>>>>> Dian Fu and I have discussed offline and have drafted the
>>> FLIP-137[2].
>>>> It
>>>>>> includes the following items:
>>>>>>  - Support Pandas UDAF in Batch Group Aggregation
>>>>>>  - Support Pandas UDAF in Batch Group Window Aggregation
>>>>>>  - Support Pandas UDAF in Batch Over Window Aggregation
>>>>>>  - Support Pandas UDAF in Stream Group Window Aggregation
>>>>>>  - Support Pandas UDAF in Stream Bounded Over Window Aggregation
>>>>>> 
>>>>>> 
>>>>>> Looking forward to your feedback!
>>>>>> 
>>>>>> Best,
>>>>>> Xingbo
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
>>>>>> [2]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by jincheng sun <su...@gmail.com>.
Thank you! looking forward to the voting :)

Best,
Jincheng


Xingbo Huang <hx...@gmail.com> 于2020年9月3日周四 下午2:39写道:

> Hi Jincheng,
>
> Yes, I agree that users can extend the class `AggregateFunction` if they
> want to define a Pandas UDAF by the way of custom classes. I have updated
> the part of the FLIP.
>
> Best,
> Xingbo
>
> jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:
>
> > Thanks for the update Xingbo!
> >
> > Pandas UDAF can reuse the `class aggregate function (user defined
> > function)` interface in FLIP-139, and the core logic of Pandas UDAF users
> > is written in the `accumulate` method. In this way, we can unify the
> > interface semantics of all UDAF.
> >
> > What do you think?
> >
> > Best,
> > Jincheng
> >
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
> >
> > > Hi Jincheng,
> > >
> > > Thanks a lot for joining the discussion and the suggestion of
> discussing
> > > FLIP-137 and FLIP-139 together.
> > >
> > > >> 1. We also need to consider how pandas UDAF supports metrics, and
> > > whether
> > > we need a custom interface for pandas UDAF?
> > >
> > > Yes. We need to add an interface so that users can add some logic in
> the
> > > `open` or `close` method such as creating metrics. I have added the
> > > definition of the interface and the corresponding example in the doc.
> > >
> > > >> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
> > >
> > > Yes. From the overall view of Python User Defined Function, we use @udf
> > to
> > > describe general python udf and pandas udf, @udtf to describe python
> > udtf,
> > > and @udaf to describe general python udaf and pandas udaf, which is
> more
> > > unified. I will discuss it in FLIP-139 later.
> > >
> > > Best,
> > > Xingbo
> > >
> > > jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
> > >
> > > > Hi Xingbo,
> > > >
> > > > Thanks for the discussion! Overall, + 1 for this FLIP.
> > > > I have two points to add:
> > > >
> > > >  - We also need to consider how pandas UDAF supports metrics, and
> > whether
> > > > we need a custom interface for pandas UDAF?
> > > >  - We have added @udaf(), so whether to use ordinary Python UDAF? If
> > not,
> > > > the addition of @udaf is not appropriate. We need to discuss it
> > further.
> > > >
> > > > We can consider it combination with FLIP-139 for design. What do you
> > > think?
> > > >
> > > > Best,
> > > > Jincheng
> > > >
> > > >
> > > > Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I would like to start a discussion thread on "Support Pandas UDAF
> in
> > > > > PyFlink"
> > > > >
> > > > > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
> > the
> > > > > high serialization/deserialization overhead in Python UDF and makes
> > it
> > > > > convenient to leverage the popular Python libraries such as Pandas,
> > > > Numpy,
> > > > > etc. Since Pandas UDF has so many advantages, we want to support
> > Pandas
> > > > > UDAF to extend usage of Pandas UDF.
> > > > >
> > > > > Dian Fu and I have discussed offline and have drafted the
> > FLIP-137[2].
> > > It
> > > > > includes the following items:
> > > > >   - Support Pandas UDAF in Batch Group Aggregation
> > > > >   - Support Pandas UDAF in Batch Group Window Aggregation
> > > > >   - Support Pandas UDAF in Batch Over Window Aggregation
> > > > >   - Support Pandas UDAF in Stream Group Window Aggregation
> > > > >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> > > > >
> > > > >
> > > > > Looking forward to your feedback!
> > > > >
> > > > > Best,
> > > > > Xingbo
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Xingbo Huang <hx...@gmail.com>.
Hi Jincheng,

Yes, I agree that users can extend the class `AggregateFunction` if they
want to define a Pandas UDAF by the way of custom classes. I have updated
the part of the FLIP.

Best,
Xingbo

jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:

> Thanks for the update Xingbo!
>
> Pandas UDAF can reuse the `class aggregate function (user defined
> function)` interface in FLIP-139, and the core logic of Pandas UDAF users
> is written in the `accumulate` method. In this way, we can unify the
> interface semantics of all UDAF.
>
> What do you think?
>
> Best,
> Jincheng
>
>
>
> Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
>
> > Hi Jincheng,
> >
> > Thanks a lot for joining the discussion and the suggestion of discussing
> > FLIP-137 and FLIP-139 together.
> >
> > >> 1. We also need to consider how pandas UDAF supports metrics, and
> > whether
> > we need a custom interface for pandas UDAF?
> >
> > Yes. We need to add an interface so that users can add some logic in the
> > `open` or `close` method such as creating metrics. I have added the
> > definition of the interface and the corresponding example in the doc.
> >
> > >> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
> >
> > Yes. From the overall view of Python User Defined Function, we use @udf
> to
> > describe general python udf and pandas udf, @udtf to describe python
> udtf,
> > and @udaf to describe general python udaf and pandas udaf, which is more
> > unified. I will discuss it in FLIP-139 later.
> >
> > Best,
> > Xingbo
> >
> > jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
> >
> > > Hi Xingbo,
> > >
> > > Thanks for the discussion! Overall, + 1 for this FLIP.
> > > I have two points to add:
> > >
> > >  - We also need to consider how pandas UDAF supports metrics, and
> whether
> > > we need a custom interface for pandas UDAF?
> > >  - We have added @udaf(), so whether to use ordinary Python UDAF? If
> not,
> > > the addition of @udaf is not appropriate. We need to discuss it
> further.
> > >
> > > We can consider it combination with FLIP-139 for design. What do you
> > think?
> > >
> > > Best,
> > > Jincheng
> > >
> > >
> > > Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> > >
> > > > Hi everyone,
> > > >
> > > > I would like to start a discussion thread on "Support Pandas UDAF in
> > > > PyFlink"
> > > >
> > > > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
> the
> > > > high serialization/deserialization overhead in Python UDF and makes
> it
> > > > convenient to leverage the popular Python libraries such as Pandas,
> > > Numpy,
> > > > etc. Since Pandas UDF has so many advantages, we want to support
> Pandas
> > > > UDAF to extend usage of Pandas UDF.
> > > >
> > > > Dian Fu and I have discussed offline and have drafted the
> FLIP-137[2].
> > It
> > > > includes the following items:
> > > >   - Support Pandas UDAF in Batch Group Aggregation
> > > >   - Support Pandas UDAF in Batch Group Window Aggregation
> > > >   - Support Pandas UDAF in Batch Over Window Aggregation
> > > >   - Support Pandas UDAF in Stream Group Window Aggregation
> > > >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> > > >
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Best,
> > > > Xingbo
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> > > >
> > >
> >
>