You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Xingbo Huang <hx...@gmail.com> on 2020/08/24 06:24:31 UTC

[DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Hi everyone,

I would like to start a discussion thread on "Support Pandas UDAF in
PyFlink"

Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves the
high serialization/deserialization overhead in Python UDF and makes it
convenient to leverage the popular Python libraries such as Pandas, Numpy,
etc. Since Pandas UDF has so many advantages, we want to support Pandas
UDAF to extend usage of Pandas UDF.

Dian Fu and I have discussed offline and have drafted the FLIP-137[2]. It
includes the following items:
  - Support Pandas UDAF in Batch Group Aggregation
  - Support Pandas UDAF in Batch Group Window Aggregation
  - Support Pandas UDAF in Batch Over Window Aggregation
  - Support Pandas UDAF in Stream Group Window Aggregation
  - Support Pandas UDAF in Stream Bounded Over Window Aggregation


Looking forward to your feedback!

Best,
Xingbo

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Xingbo Huang <hx...@gmail.com>.
Hi everyone,

Thanks all of you for the discussion.
If there are no objections, I would like to start a vote thread tomorrow.

Best,
Xingbo

Dian Fu <di...@gmail.com> 于2020年9月3日周四 下午5:45写道:

> Thanks for preparing the FLIP, xingbo!
>
> LGTM overall and looking forward to the voting!
>
> Regards,
> Dian
>
> > 在 2020年9月3日,下午5:22,jincheng sun <su...@gmail.com> 写道:
> >
> > Thank you! looking forward to the voting :)
> >
> > Best,
> > Jincheng
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年9月3日周四 下午2:39写道:
> >
> >> Hi Jincheng,
> >>
> >> Yes, I agree that users can extend the class `AggregateFunction` if they
> >> want to define a Pandas UDAF by the way of custom classes. I have
> updated
> >> the part of the FLIP.
> >>
> >> Best,
> >> Xingbo
> >>
> >> jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:
> >>
> >>> Thanks for the update Xingbo!
> >>>
> >>> Pandas UDAF can reuse the `class aggregate function (user defined
> >>> function)` interface in FLIP-139, and the core logic of Pandas UDAF
> users
> >>> is written in the `accumulate` method. In this way, we can unify the
> >>> interface semantics of all UDAF.
> >>>
> >>> What do you think?
> >>>
> >>> Best,
> >>> Jincheng
> >>>
> >>>
> >>>
> >>> Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
> >>>
> >>>> Hi Jincheng,
> >>>>
> >>>> Thanks a lot for joining the discussion and the suggestion of
> >> discussing
> >>>> FLIP-137 and FLIP-139 together.
> >>>>
> >>>>>> 1. We also need to consider how pandas UDAF supports metrics, and
> >>>> whether
> >>>> we need a custom interface for pandas UDAF?
> >>>>
> >>>> Yes. We need to add an interface so that users can add some logic in
> >> the
> >>>> `open` or `close` method such as creating metrics. I have added the
> >>>> definition of the interface and the corresponding example in the doc.
> >>>>
> >>>>>> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
> >>>>
> >>>> Yes. From the overall view of Python User Defined Function, we use
> @udf
> >>> to
> >>>> describe general python udf and pandas udf, @udtf to describe python
> >>> udtf,
> >>>> and @udaf to describe general python udaf and pandas udaf, which is
> >> more
> >>>> unified. I will discuss it in FLIP-139 later.
> >>>>
> >>>> Best,
> >>>> Xingbo
> >>>>
> >>>> jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
> >>>>
> >>>>> Hi Xingbo,
> >>>>>
> >>>>> Thanks for the discussion! Overall, + 1 for this FLIP.
> >>>>> I have two points to add:
> >>>>>
> >>>>> - We also need to consider how pandas UDAF supports metrics, and
> >>> whether
> >>>>> we need a custom interface for pandas UDAF?
> >>>>> - We have added @udaf(), so whether to use ordinary Python UDAF? If
> >>> not,
> >>>>> the addition of @udaf is not appropriate. We need to discuss it
> >>> further.
> >>>>>
> >>>>> We can consider it combination with FLIP-139 for design. What do you
> >>>> think?
> >>>>>
> >>>>> Best,
> >>>>> Jincheng
> >>>>>
> >>>>>
> >>>>> Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> >>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I would like to start a discussion thread on "Support Pandas UDAF
> >> in
> >>>>>> PyFlink"
> >>>>>>
> >>>>>> Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
> >>> the
> >>>>>> high serialization/deserialization overhead in Python UDF and makes
> >>> it
> >>>>>> convenient to leverage the popular Python libraries such as Pandas,
> >>>>> Numpy,
> >>>>>> etc. Since Pandas UDF has so many advantages, we want to support
> >>> Pandas
> >>>>>> UDAF to extend usage of Pandas UDF.
> >>>>>>
> >>>>>> Dian Fu and I have discussed offline and have drafted the
> >>> FLIP-137[2].
> >>>> It
> >>>>>> includes the following items:
> >>>>>>  - Support Pandas UDAF in Batch Group Aggregation
> >>>>>>  - Support Pandas UDAF in Batch Group Window Aggregation
> >>>>>>  - Support Pandas UDAF in Batch Over Window Aggregation
> >>>>>>  - Support Pandas UDAF in Stream Group Window Aggregation
> >>>>>>  - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> >>>>>>
> >>>>>>
> >>>>>> Looking forward to your feedback!
> >>>>>>
> >>>>>> Best,
> >>>>>> Xingbo
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> >>>>>> [2]
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Dian Fu <di...@gmail.com>.
Thanks for preparing the FLIP, xingbo!

LGTM overall and looking forward to the voting!

Regards,
Dian

> 在 2020年9月3日,下午5:22,jincheng sun <su...@gmail.com> 写道:
> 
> Thank you! looking forward to the voting :)
> 
> Best,
> Jincheng
> 
> 
> Xingbo Huang <hx...@gmail.com> 于2020年9月3日周四 下午2:39写道:
> 
>> Hi Jincheng,
>> 
>> Yes, I agree that users can extend the class `AggregateFunction` if they
>> want to define a Pandas UDAF by the way of custom classes. I have updated
>> the part of the FLIP.
>> 
>> Best,
>> Xingbo
>> 
>> jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:
>> 
>>> Thanks for the update Xingbo!
>>> 
>>> Pandas UDAF can reuse the `class aggregate function (user defined
>>> function)` interface in FLIP-139, and the core logic of Pandas UDAF users
>>> is written in the `accumulate` method. In this way, we can unify the
>>> interface semantics of all UDAF.
>>> 
>>> What do you think?
>>> 
>>> Best,
>>> Jincheng
>>> 
>>> 
>>> 
>>> Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
>>> 
>>>> Hi Jincheng,
>>>> 
>>>> Thanks a lot for joining the discussion and the suggestion of
>> discussing
>>>> FLIP-137 and FLIP-139 together.
>>>> 
>>>>>> 1. We also need to consider how pandas UDAF supports metrics, and
>>>> whether
>>>> we need a custom interface for pandas UDAF?
>>>> 
>>>> Yes. We need to add an interface so that users can add some logic in
>> the
>>>> `open` or `close` method such as creating metrics. I have added the
>>>> definition of the interface and the corresponding example in the doc.
>>>> 
>>>>>> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
>>>> 
>>>> Yes. From the overall view of Python User Defined Function, we use @udf
>>> to
>>>> describe general python udf and pandas udf, @udtf to describe python
>>> udtf,
>>>> and @udaf to describe general python udaf and pandas udaf, which is
>> more
>>>> unified. I will discuss it in FLIP-139 later.
>>>> 
>>>> Best,
>>>> Xingbo
>>>> 
>>>> jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
>>>> 
>>>>> Hi Xingbo,
>>>>> 
>>>>> Thanks for the discussion! Overall, + 1 for this FLIP.
>>>>> I have two points to add:
>>>>> 
>>>>> - We also need to consider how pandas UDAF supports metrics, and
>>> whether
>>>>> we need a custom interface for pandas UDAF?
>>>>> - We have added @udaf(), so whether to use ordinary Python UDAF? If
>>> not,
>>>>> the addition of @udaf is not appropriate. We need to discuss it
>>> further.
>>>>> 
>>>>> We can consider it combination with FLIP-139 for design. What do you
>>>> think?
>>>>> 
>>>>> Best,
>>>>> Jincheng
>>>>> 
>>>>> 
>>>>> Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> I would like to start a discussion thread on "Support Pandas UDAF
>> in
>>>>>> PyFlink"
>>>>>> 
>>>>>> Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
>>> the
>>>>>> high serialization/deserialization overhead in Python UDF and makes
>>> it
>>>>>> convenient to leverage the popular Python libraries such as Pandas,
>>>>> Numpy,
>>>>>> etc. Since Pandas UDF has so many advantages, we want to support
>>> Pandas
>>>>>> UDAF to extend usage of Pandas UDF.
>>>>>> 
>>>>>> Dian Fu and I have discussed offline and have drafted the
>>> FLIP-137[2].
>>>> It
>>>>>> includes the following items:
>>>>>>  - Support Pandas UDAF in Batch Group Aggregation
>>>>>>  - Support Pandas UDAF in Batch Group Window Aggregation
>>>>>>  - Support Pandas UDAF in Batch Over Window Aggregation
>>>>>>  - Support Pandas UDAF in Stream Group Window Aggregation
>>>>>>  - Support Pandas UDAF in Stream Bounded Over Window Aggregation
>>>>>> 
>>>>>> 
>>>>>> Looking forward to your feedback!
>>>>>> 
>>>>>> Best,
>>>>>> Xingbo
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
>>>>>> [2]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by jincheng sun <su...@gmail.com>.
Thank you! looking forward to the voting :)

Best,
Jincheng


Xingbo Huang <hx...@gmail.com> 于2020年9月3日周四 下午2:39写道:

> Hi Jincheng,
>
> Yes, I agree that users can extend the class `AggregateFunction` if they
> want to define a Pandas UDAF by the way of custom classes. I have updated
> the part of the FLIP.
>
> Best,
> Xingbo
>
> jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:
>
> > Thanks for the update Xingbo!
> >
> > Pandas UDAF can reuse the `class aggregate function (user defined
> > function)` interface in FLIP-139, and the core logic of Pandas UDAF users
> > is written in the `accumulate` method. In this way, we can unify the
> > interface semantics of all UDAF.
> >
> > What do you think?
> >
> > Best,
> > Jincheng
> >
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
> >
> > > Hi Jincheng,
> > >
> > > Thanks a lot for joining the discussion and the suggestion of
> discussing
> > > FLIP-137 and FLIP-139 together.
> > >
> > > >> 1. We also need to consider how pandas UDAF supports metrics, and
> > > whether
> > > we need a custom interface for pandas UDAF?
> > >
> > > Yes. We need to add an interface so that users can add some logic in
> the
> > > `open` or `close` method such as creating metrics. I have added the
> > > definition of the interface and the corresponding example in the doc.
> > >
> > > >> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
> > >
> > > Yes. From the overall view of Python User Defined Function, we use @udf
> > to
> > > describe general python udf and pandas udf, @udtf to describe python
> > udtf,
> > > and @udaf to describe general python udaf and pandas udaf, which is
> more
> > > unified. I will discuss it in FLIP-139 later.
> > >
> > > Best,
> > > Xingbo
> > >
> > > jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
> > >
> > > > Hi Xingbo,
> > > >
> > > > Thanks for the discussion! Overall, + 1 for this FLIP.
> > > > I have two points to add:
> > > >
> > > >  - We also need to consider how pandas UDAF supports metrics, and
> > whether
> > > > we need a custom interface for pandas UDAF?
> > > >  - We have added @udaf(), so whether to use ordinary Python UDAF? If
> > not,
> > > > the addition of @udaf is not appropriate. We need to discuss it
> > further.
> > > >
> > > > We can consider it combination with FLIP-139 for design. What do you
> > > think?
> > > >
> > > > Best,
> > > > Jincheng
> > > >
> > > >
> > > > Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I would like to start a discussion thread on "Support Pandas UDAF
> in
> > > > > PyFlink"
> > > > >
> > > > > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
> > the
> > > > > high serialization/deserialization overhead in Python UDF and makes
> > it
> > > > > convenient to leverage the popular Python libraries such as Pandas,
> > > > Numpy,
> > > > > etc. Since Pandas UDF has so many advantages, we want to support
> > Pandas
> > > > > UDAF to extend usage of Pandas UDF.
> > > > >
> > > > > Dian Fu and I have discussed offline and have drafted the
> > FLIP-137[2].
> > > It
> > > > > includes the following items:
> > > > >   - Support Pandas UDAF in Batch Group Aggregation
> > > > >   - Support Pandas UDAF in Batch Group Window Aggregation
> > > > >   - Support Pandas UDAF in Batch Over Window Aggregation
> > > > >   - Support Pandas UDAF in Stream Group Window Aggregation
> > > > >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> > > > >
> > > > >
> > > > > Looking forward to your feedback!
> > > > >
> > > > > Best,
> > > > > Xingbo
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Xingbo Huang <hx...@gmail.com>.
Hi Jincheng,

Yes, I agree that users can extend the class `AggregateFunction` if they
want to define a Pandas UDAF by the way of custom classes. I have updated
the part of the FLIP.

Best,
Xingbo

jincheng sun <su...@gmail.com> 于2020年9月3日周四 下午1:48写道:

> Thanks for the update Xingbo!
>
> Pandas UDAF can reuse the `class aggregate function (user defined
> function)` interface in FLIP-139, and the core logic of Pandas UDAF users
> is written in the `accumulate` method. In this way, we can unify the
> interface semantics of all UDAF.
>
> What do you think?
>
> Best,
> Jincheng
>
>
>
> Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:
>
> > Hi Jincheng,
> >
> > Thanks a lot for joining the discussion and the suggestion of discussing
> > FLIP-137 and FLIP-139 together.
> >
> > >> 1. We also need to consider how pandas UDAF supports metrics, and
> > whether
> > we need a custom interface for pandas UDAF?
> >
> > Yes. We need to add an interface so that users can add some logic in the
> > `open` or `close` method such as creating metrics. I have added the
> > definition of the interface and the corresponding example in the doc.
> >
> > >> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
> >
> > Yes. From the overall view of Python User Defined Function, we use @udf
> to
> > describe general python udf and pandas udf, @udtf to describe python
> udtf,
> > and @udaf to describe general python udaf and pandas udaf, which is more
> > unified. I will discuss it in FLIP-139 later.
> >
> > Best,
> > Xingbo
> >
> > jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
> >
> > > Hi Xingbo,
> > >
> > > Thanks for the discussion! Overall, + 1 for this FLIP.
> > > I have two points to add:
> > >
> > >  - We also need to consider how pandas UDAF supports metrics, and
> whether
> > > we need a custom interface for pandas UDAF?
> > >  - We have added @udaf(), so whether to use ordinary Python UDAF? If
> not,
> > > the addition of @udaf is not appropriate. We need to discuss it
> further.
> > >
> > > We can consider it combination with FLIP-139 for design. What do you
> > think?
> > >
> > > Best,
> > > Jincheng
> > >
> > >
> > > Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> > >
> > > > Hi everyone,
> > > >
> > > > I would like to start a discussion thread on "Support Pandas UDAF in
> > > > PyFlink"
> > > >
> > > > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves
> the
> > > > high serialization/deserialization overhead in Python UDF and makes
> it
> > > > convenient to leverage the popular Python libraries such as Pandas,
> > > Numpy,
> > > > etc. Since Pandas UDF has so many advantages, we want to support
> Pandas
> > > > UDAF to extend usage of Pandas UDF.
> > > >
> > > > Dian Fu and I have discussed offline and have drafted the
> FLIP-137[2].
> > It
> > > > includes the following items:
> > > >   - Support Pandas UDAF in Batch Group Aggregation
> > > >   - Support Pandas UDAF in Batch Group Window Aggregation
> > > >   - Support Pandas UDAF in Batch Over Window Aggregation
> > > >   - Support Pandas UDAF in Stream Group Window Aggregation
> > > >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> > > >
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Best,
> > > > Xingbo
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by jincheng sun <su...@gmail.com>.
Thanks for the update Xingbo!

Pandas UDAF can reuse the `class aggregate function (user defined
function)` interface in FLIP-139, and the core logic of Pandas UDAF users
is written in the `accumulate` method. In this way, we can unify the
interface semantics of all UDAF.

What do you think?

Best,
Jincheng



Xingbo Huang <hx...@gmail.com> 于2020年8月31日周一 下午6:06写道:

> Hi Jincheng,
>
> Thanks a lot for joining the discussion and the suggestion of discussing
> FLIP-137 and FLIP-139 together.
>
> >> 1. We also need to consider how pandas UDAF supports metrics, and
> whether
> we need a custom interface for pandas UDAF?
>
> Yes. We need to add an interface so that users can add some logic in the
> `open` or `close` method such as creating metrics. I have added the
> definition of the interface and the corresponding example in the doc.
>
> >> 2. We have added @udaf(), so whether to use ordinary Python UDAF?
>
> Yes. From the overall view of Python User Defined Function, we use @udf to
> describe general python udf and pandas udf, @udtf to describe python udtf,
> and @udaf to describe general python udaf and pandas udaf, which is more
> unified. I will discuss it in FLIP-139 later.
>
> Best,
> Xingbo
>
> jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:
>
> > Hi Xingbo,
> >
> > Thanks for the discussion! Overall, + 1 for this FLIP.
> > I have two points to add:
> >
> >  - We also need to consider how pandas UDAF supports metrics, and whether
> > we need a custom interface for pandas UDAF?
> >  - We have added @udaf(), so whether to use ordinary Python UDAF? If not,
> > the addition of @udaf is not appropriate. We need to discuss it further.
> >
> > We can consider it combination with FLIP-139 for design. What do you
> think?
> >
> > Best,
> > Jincheng
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
> >
> > > Hi everyone,
> > >
> > > I would like to start a discussion thread on "Support Pandas UDAF in
> > > PyFlink"
> > >
> > > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves the
> > > high serialization/deserialization overhead in Python UDF and makes it
> > > convenient to leverage the popular Python libraries such as Pandas,
> > Numpy,
> > > etc. Since Pandas UDF has so many advantages, we want to support Pandas
> > > UDAF to extend usage of Pandas UDF.
> > >
> > > Dian Fu and I have discussed offline and have drafted the FLIP-137[2].
> It
> > > includes the following items:
> > >   - Support Pandas UDAF in Batch Group Aggregation
> > >   - Support Pandas UDAF in Batch Group Window Aggregation
> > >   - Support Pandas UDAF in Batch Over Window Aggregation
> > >   - Support Pandas UDAF in Stream Group Window Aggregation
> > >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> > >
> > >
> > > Looking forward to your feedback!
> > >
> > > Best,
> > > Xingbo
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> > >
> >
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by Xingbo Huang <hx...@gmail.com>.
Hi Jincheng,

Thanks a lot for joining the discussion and the suggestion of discussing
FLIP-137 and FLIP-139 together.

>> 1. We also need to consider how pandas UDAF supports metrics, and whether
we need a custom interface for pandas UDAF?

Yes. We need to add an interface so that users can add some logic in the
`open` or `close` method such as creating metrics. I have added the
definition of the interface and the corresponding example in the doc.

>> 2. We have added @udaf(), so whether to use ordinary Python UDAF?

Yes. From the overall view of Python User Defined Function, we use @udf to
describe general python udf and pandas udf, @udtf to describe python udtf,
and @udaf to describe general python udaf and pandas udaf, which is more
unified. I will discuss it in FLIP-139 later.

Best,
Xingbo

jincheng sun <su...@gmail.com> 于2020年8月31日周一 上午11:05写道:

> Hi Xingbo,
>
> Thanks for the discussion! Overall, + 1 for this FLIP.
> I have two points to add:
>
>  - We also need to consider how pandas UDAF supports metrics, and whether
> we need a custom interface for pandas UDAF?
>  - We have added @udaf(), so whether to use ordinary Python UDAF? If not,
> the addition of @udaf is not appropriate. We need to discuss it further.
>
> We can consider it combination with FLIP-139 for design. What do you think?
>
> Best,
> Jincheng
>
>
> Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:
>
> > Hi everyone,
> >
> > I would like to start a discussion thread on "Support Pandas UDAF in
> > PyFlink"
> >
> > Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves the
> > high serialization/deserialization overhead in Python UDF and makes it
> > convenient to leverage the popular Python libraries such as Pandas,
> Numpy,
> > etc. Since Pandas UDF has so many advantages, we want to support Pandas
> > UDAF to extend usage of Pandas UDF.
> >
> > Dian Fu and I have discussed offline and have drafted the FLIP-137[2]. It
> > includes the following items:
> >   - Support Pandas UDAF in Batch Group Aggregation
> >   - Support Pandas UDAF in Batch Group Window Aggregation
> >   - Support Pandas UDAF in Batch Over Window Aggregation
> >   - Support Pandas UDAF in Stream Group Window Aggregation
> >   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
> >
> >
> > Looking forward to your feedback!
> >
> > Best,
> > Xingbo
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
> >
>

Re: [DISCUSS] FLIP-137: Support Pandas UDAF in PyFlink

Posted by jincheng sun <su...@gmail.com>.
Hi Xingbo,

Thanks for the discussion! Overall, + 1 for this FLIP.
I have two points to add:

 - We also need to consider how pandas UDAF supports metrics, and whether
we need a custom interface for pandas UDAF?
 - We have added @udaf(), so whether to use ordinary Python UDAF? If not,
the addition of @udaf is not appropriate. We need to discuss it further.

We can consider it combination with FLIP-139 for design. What do you think?

Best,
Jincheng


Xingbo Huang <hx...@gmail.com> 于2020年8月24日周一 下午2:25写道:

> Hi everyone,
>
> I would like to start a discussion thread on "Support Pandas UDAF in
> PyFlink"
>
> Pandas UDF has been supported in FLINK 1.11 (FLIP-97[1]). It solves the
> high serialization/deserialization overhead in Python UDF and makes it
> convenient to leverage the popular Python libraries such as Pandas, Numpy,
> etc. Since Pandas UDF has so many advantages, we want to support Pandas
> UDAF to extend usage of Pandas UDF.
>
> Dian Fu and I have discussed offline and have drafted the FLIP-137[2]. It
> includes the following items:
>   - Support Pandas UDAF in Batch Group Aggregation
>   - Support Pandas UDAF in Batch Group Window Aggregation
>   - Support Pandas UDAF in Batch Over Window Aggregation
>   - Support Pandas UDAF in Stream Group Window Aggregation
>   - Support Pandas UDAF in Stream Bounded Over Window Aggregation
>
>
> Looking forward to your feedback!
>
> Best,
> Xingbo
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-137%3A+Support+Pandas+UDAF+in+PyFlink
>