You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Takuya UESHIN <ue...@happy-camper.st> on 2017/09/01 06:01:26 UTC

[VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Hi all,

We've been discussing to support vectorized UDFs in Python and we almost
got a consensus about the APIs, so I'd like to summarize and call for a
vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for
vectorized UDAFs or Window operations.

https://issues.apache.org/jira/browse/SPARK-21190


*Proposed API*

We introduce a @pandas_udf decorator (or annotation) to define vectorized
UDFs which takes one or more pandas.Series or one integer value meaning the
length of the input value for 0-parameter UDFs. The return value should be
pandas.Series of the specified type and the length of the returned value
should be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
      return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
      return pd.Series(1).repeat(size)

  df.select(f0())



The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thanks!

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Takuya UESHIN <ue...@happy-camper.st>.

This vote passes with 4 binding +1 votes, 6 non-binding votes, no +0 vote,
and no -1 votes.

Thanks all!

+1 votes (binding):
Reynold Xin
Wenchen Fan
Yin Huai
Matei Zaharia


+1 votes (non-binding):
Felix Cheung
Bryan Cutler
Sameer Agarwal
Hyukjin Kwon
Xiao Li
Liang-Chi Hsieh



On Tue, Sep 12, 2017 at 11:46 AM, Liang-Chi Hsieh <vi...@gmail.com> wrote:

> +1
>
>
> Xiao Li wrote
> > +1
> >
> > Xiao
> > On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia &lt;
>
> > matei.zaharia@
>
> > &gt;
> > wrote:
> >
> >> +1 (binding)
> >>
> >> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon &lt;
>
> > gurwls223@
>
> > &gt; wrote:
> >> >
> >> > +1 (non-binding)
> >> >
> >> >
> >> > 2017-09-12 9:52 GMT+09:00 Yin Huai &lt;
>
> > yhuai@
>
> > &gt;:
> >> > +1
> >> >
> >> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal &lt;
>
> > sameer@
>
> > &gt;
> >> wrote:
> >> > +1 (non-binding)
> >> >
> >> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler &lt;
>
> > cutlerb@
>
> > &gt; wrote:
> >> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
> >> it's
> >> fine to work out the minor details of the API during review.
> >> >
> >> > Bryan
> >> >
> >> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN &lt;
>
> > ueshin@
>
> > &gt;
> >> wrote:
> >> > Hi all,
> >> >
> >> > Thank you for voting and suggestions.
> >> >
> >> > As Wenchen mentioned and also we're discussing at JIRA, we need to
> >> discuss the size hint for the 0-parameter UDF.
> >> > But I believe we got a consensus about the basic APIs except for the
> >> size hint, I'd like to submit a pr based on the current proposal and
> >> continue discussing in its review.
> >> >
> >> >     https://github.com/apache/spark/pull/19147
> >> >
> >> > I'd keep this vote open to wait for more opinions.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan &lt;
>
> > cloud0fan@
>
> > &gt; wrote:
> >> > +1 on the design and proposed API.
> >> >
> >> > One detail I'd like to discuss is the 0-parameter UDF, how we can
> >> specify the size hint. This can be done in the PR review though.
> >> >
> >> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;
> >> wrote:
> >> > +1 on this and like the suggestion of type in string form.
> >> >
> >> > Would it be correct to assume there will be data type check, for
> >> example
> >> the returned pandas data frame column data types match what are
> >> specified.
> >> We have seen quite a bit of issues/confusions with that in R.
> >> >
> >> > Would it make sense to have a more generic decorator name so that it
> >> could also be useable for other efficient vectorized format in the
> >> future?
> >> Or do we anticipate the decorator to be format specific and will have
> >> more
> >> in the future?
> >> >
> >> > From: Reynold Xin &lt;
>
> > rxin@
>
> > &gt;
> >> > Sent: Friday, September 1, 2017 5:16:11 AM
> >> > To: Takuya UESHIN
> >> > Cc: spark-dev
> >> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
> >> >
> >> > Ok, thanks.
> >> >
> >> > +1 on the SPIP for scope etc
> >> >
> >> >
> >> > On API details (will deal with in code reviews as well but leaving a
> >> note here in case I forget)
> >> >
> >> > 1. I would suggest having the API also accept data type specification
> >> in
> >> string form. It is usually simpler to say "long" then "LongType()".
> >> >
> >> > 2. Think about what error message to show when the rows numbers don't
> >> match at runtime.
> >> >
> >> >
> >> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN &lt;
>
> > ueshin@
>
> > &gt;
> >> wrote:
> >> > Yes, the aggregation is out of scope for now.
> >> > I think we should continue discussing the aggregation at JIRA and we
> >> will be adding those later separately.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin &lt;
>
> > rxin@
>
> > &gt; wrote:
> >> > Is the idea aggregate is out of scope for the current effort and we
> >> will
> >> be adding those later?
> >> >
> >> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN &lt;
>
> > ueshin@
>
> > &gt;
> >> wrote:
> >> > Hi all,
> >> >
> >> > We've been discussing to support vectorized UDFs in Python and we
> >> almost
> >> got a consensus about the APIs, so I'd like to summarize and call for a
> >> vote.
> >> >
> >> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
> >> for vectorized UDAFs or Window operations.
> >> >
> >> > https://issues.apache.org/jira/browse/SPARK-21190
> >> >
> >> >
> >> > Proposed API
> >> >
> >> > We introduce a @pandas_udf decorator (or annotation) to define
> >> vectorized UDFs which takes one or more pandas.Series or one integer
> >> value
> >> meaning the length of the input value for 0-parameter UDFs. The return
> >> value should be pandas.Series of the specified type and the length of
> the
> >> returned value should be the same as input value.
> >> >
> >> > We can define vectorized UDFs as:
> >> >
> >> >   @pandas_udf(DoubleType())
> >> >   def plus(v1, v2):
> >> >       return v1 + v2
> >> >
> >> > or we can define as:
> >> >
> >> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> >> >
> >> > We can use it similar to row-by-row UDFs:
> >> >
> >> >   df.withColumn('sum', plus(df.v1, df.v2))
> >> >
> >> > As for 0-parameter UDFs, we can define and use as:
> >> >
> >> >   @pandas_udf(LongType())
> >> >   def f0(size):
> >> >       return pd.Series(1).repeat(size)
> >> >
> >> >   df.select(f0())
> >> >
> >> >
> >> >
> >> > The vote will be up for the next 72 hours. Please reply with your
> vote:
> >> >
> >> > +1: Yeah, let's go forward and implement the SPIP.
> >> > +0: Don't really care.
> >> > -1: I don't think this is a good idea because of the following
> >> technical
> >> reasons.
> >> >
> >> > Thanks!
> >> >
> >> > --
> >> > Takuya UESHIN
> >> > Tokyo, Japan
> >> >
> >> > http://twitter.com/ueshin
> >> >
> >> >
> >> >
> >> > --
> >> > Takuya UESHIN
> >> > Tokyo, Japan
> >> >
> >> > http://twitter.com/ueshin
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Takuya UESHIN
> >> > Tokyo, Japan
> >> >
> >> > http://twitter.com/ueshin
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Sameer Agarwal
> >> > Software Engineer | Databricks Inc.
> >> > http://cs.berkeley.edu/~sameerag
> >> >
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

+1


Xiao Li wrote
> +1
> 
> Xiao
> On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia &lt;

> matei.zaharia@

> &gt;
> wrote:
> 
>> +1 (binding)
>>
>> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon &lt;

> gurwls223@

> &gt; wrote:
>> >
>> > +1 (non-binding)
>> >
>> >
>> > 2017-09-12 9:52 GMT+09:00 Yin Huai &lt;

> yhuai@

> &gt;:
>> > +1
>> >
>> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal &lt;

> sameer@

> &gt;
>> wrote:
>> > +1 (non-binding)
>> >
>> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler &lt;

> cutlerb@

> &gt; wrote:
>> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
>> it's
>> fine to work out the minor details of the API during review.
>> >
>> > Bryan
>> >
>> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Hi all,
>> >
>> > Thank you for voting and suggestions.
>> >
>> > As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> > But I believe we got a consensus about the basic APIs except for the
>> size hint, I'd like to submit a pr based on the current proposal and
>> continue discussing in its review.
>> >
>> >     https://github.com/apache/spark/pull/19147
>> >
>> > I'd keep this vote open to wait for more opinions.
>> >
>> > Thanks.
>> >
>> >
>> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan &lt;

> cloud0fan@

> &gt; wrote:
>> > +1 on the design and proposed API.
>> >
>> > One detail I'd like to discuss is the 0-parameter UDF, how we can
>> specify the size hint. This can be done in the PR review though.
>> >
>> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung &lt;

> felixcheung_m@

> &gt;
>> wrote:
>> > +1 on this and like the suggestion of type in string form.
>> >
>> > Would it be correct to assume there will be data type check, for
>> example
>> the returned pandas data frame column data types match what are
>> specified.
>> We have seen quite a bit of issues/confusions with that in R.
>> >
>> > Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the
>> future?
>> Or do we anticipate the decorator to be format specific and will have
>> more
>> in the future?
>> >
>> > From: Reynold Xin &lt;

> rxin@

> &gt;
>> > Sent: Friday, September 1, 2017 5:16:11 AM
>> > To: Takuya UESHIN
>> > Cc: spark-dev
>> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>> >
>> > Ok, thanks.
>> >
>> > +1 on the SPIP for scope etc
>> >
>> >
>> > On API details (will deal with in code reviews as well but leaving a
>> note here in case I forget)
>> >
>> > 1. I would suggest having the API also accept data type specification
>> in
>> string form. It is usually simpler to say "long" then "LongType()".
>> >
>> > 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Yes, the aggregation is out of scope for now.
>> > I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin &lt;

> rxin@

> &gt; wrote:
>> > Is the idea aggregate is out of scope for the current effort and we
>> will
>> be adding those later?
>> >
>> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Hi all,
>> >
>> > We've been discussing to support vectorized UDFs in Python and we
>> almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>> >
>> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-21190
>> >
>> >
>> > Proposed API
>> >
>> > We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value
>> meaning the length of the input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >       return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >       return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Sameer Agarwal
>> > Software Engineer | Databricks Inc.
>> > http://cs.berkeley.edu/~sameerag
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

+1


Xiao Li wrote
> +1
> 
> Xiao
> On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia &lt;

> matei.zaharia@

> &gt;
> wrote:
> 
>> +1 (binding)
>>
>> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon &lt;

> gurwls223@

> &gt; wrote:
>> >
>> > +1 (non-binding)
>> >
>> >
>> > 2017-09-12 9:52 GMT+09:00 Yin Huai &lt;

> yhuai@

> &gt;:
>> > +1
>> >
>> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal &lt;

> sameer@

> &gt;
>> wrote:
>> > +1 (non-binding)
>> >
>> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler &lt;

> cutlerb@

> &gt; wrote:
>> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
>> it's
>> fine to work out the minor details of the API during review.
>> >
>> > Bryan
>> >
>> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Hi all,
>> >
>> > Thank you for voting and suggestions.
>> >
>> > As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> > But I believe we got a consensus about the basic APIs except for the
>> size hint, I'd like to submit a pr based on the current proposal and
>> continue discussing in its review.
>> >
>> >     https://github.com/apache/spark/pull/19147
>> >
>> > I'd keep this vote open to wait for more opinions.
>> >
>> > Thanks.
>> >
>> >
>> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan &lt;

> cloud0fan@

> &gt; wrote:
>> > +1 on the design and proposed API.
>> >
>> > One detail I'd like to discuss is the 0-parameter UDF, how we can
>> specify the size hint. This can be done in the PR review though.
>> >
>> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung &lt;

> felixcheung_m@

> &gt;
>> wrote:
>> > +1 on this and like the suggestion of type in string form.
>> >
>> > Would it be correct to assume there will be data type check, for
>> example
>> the returned pandas data frame column data types match what are
>> specified.
>> We have seen quite a bit of issues/confusions with that in R.
>> >
>> > Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the
>> future?
>> Or do we anticipate the decorator to be format specific and will have
>> more
>> in the future?
>> >
>> > From: Reynold Xin &lt;

> rxin@

> &gt;
>> > Sent: Friday, September 1, 2017 5:16:11 AM
>> > To: Takuya UESHIN
>> > Cc: spark-dev
>> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>> >
>> > Ok, thanks.
>> >
>> > +1 on the SPIP for scope etc
>> >
>> >
>> > On API details (will deal with in code reviews as well but leaving a
>> note here in case I forget)
>> >
>> > 1. I would suggest having the API also accept data type specification
>> in
>> string form. It is usually simpler to say "long" then "LongType()".
>> >
>> > 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Yes, the aggregation is out of scope for now.
>> > I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin &lt;

> rxin@

> &gt; wrote:
>> > Is the idea aggregate is out of scope for the current effort and we
>> will
>> be adding those later?
>> >
>> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Hi all,
>> >
>> > We've been discussing to support vectorized UDFs in Python and we
>> almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>> >
>> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-21190
>> >
>> >
>> > Proposed API
>> >
>> > We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value
>> meaning the length of the input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >       return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >       return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Sameer Agarwal
>> > Software Engineer | Databricks Inc.
>> > http://cs.berkeley.edu/~sameerag
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Noman Khan <no...@live.com>.

+1(non-binding)

Regards
Noman
________________________________
From: Xiao Li <ga...@gmail.com>
Sent: Tuesday, September 12, 2017 2:44:26 AM
To: Matei Zaharia; Hyukjin Kwon
Cc: spark-dev
Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

+1

Xiao
On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia <ma...@gmail.com>> wrote:
+1 (binding)

> On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon <gu...@gmail.com>> wrote:
>
> +1 (non-binding)
>
>
> 2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>>:
> +1
>
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sa...@databricks.com>> wrote:
> +1 (non-binding)
>
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cu...@gmail.com>> wrote:
> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's fine to work out the minor details of the API during review.
>
> Bryan
>
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st>> wrote:
> Hi all,
>
> Thank you for voting and suggestions.
>
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size hint, I'd like to submit a pr based on the current proposal and continue discussing in its review.
>
>     https://github.com/apache/spark/pull/19147
>
> I'd keep this vote open to wait for more opinions.
>
> Thanks.
>
>
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com>> wrote:
> +1 on the design and proposed API.
>
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify the size hint. This can be done in the PR review though.
>
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com>> wrote:
> +1 on this and like the suggestion of type in string form.
>
> Would it be correct to assume there will be data type check, for example the returned pandas data frame column data types match what are specified. We have seen quite a bit of issues/confusions with that in R.
>
> Would it make sense to have a more generic decorator name so that it could also be useable for other efficient vectorized format in the future? Or do we anticipate the decorator to be format specific and will have more in the future?
>
> From: Reynold Xin <rx...@databricks.com>>
> Sent: Friday, September 1, 2017 5:16:11 AM
> To: Takuya UESHIN
> Cc: spark-dev
> Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>
> Ok, thanks.
>
> +1 on the SPIP for scope etc
>
>
> On API details (will deal with in code reviews as well but leaving a note here in case I forget)
>
> 1. I would suggest having the API also accept data type specification in string form. It is usually simpler to say "long" then "LongType()".
>
> 2. Think about what error message to show when the rows numbers don't match at runtime.
>
>
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>> wrote:
> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will be adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com>> wrote:
> Is the idea aggregate is out of scope for the current effort and we will be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>> wrote:
> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we almost got a consensus about the APIs, so I'd like to summarize and call for a vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> Proposed API
>
> We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs which takes one or more pandas.Series or one integer value meaning the length of the input value for 0-parameter UDFs. The return value should be pandas.Series of the specified type and the length of the returned value should be the same as input value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>       return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>       return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical reasons.
>
> Thanks!
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>
>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>
>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Xiao Li <ga...@gmail.com>.

+1

Xiao
On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia <ma...@gmail.com>
wrote:

> +1 (binding)
>
> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon <gu...@gmail.com> wrote:
> >
> > +1 (non-binding)
> >
> >
> > 2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>:
> > +1
> >
> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sa...@databricks.com>
> wrote:
> > +1 (non-binding)
> >
> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cu...@gmail.com> wrote:
> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
> fine to work out the minor details of the API during review.
> >
> > Bryan
> >
> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st>
> wrote:
> > Hi all,
> >
> > Thank you for voting and suggestions.
> >
> > As Wenchen mentioned and also we're discussing at JIRA, we need to
> discuss the size hint for the 0-parameter UDF.
> > But I believe we got a consensus about the basic APIs except for the
> size hint, I'd like to submit a pr based on the current proposal and
> continue discussing in its review.
> >
> >     https://github.com/apache/spark/pull/19147
> >
> > I'd keep this vote open to wait for more opinions.
> >
> > Thanks.
> >
> >
> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com> wrote:
> > +1 on the design and proposed API.
> >
> > One detail I'd like to discuss is the 0-parameter UDF, how we can
> specify the size hint. This can be done in the PR review though.
> >
> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
> > +1 on this and like the suggestion of type in string form.
> >
> > Would it be correct to assume there will be data type check, for example
> the returned pandas data frame column data types match what are specified.
> We have seen quite a bit of issues/confusions with that in R.
> >
> > Would it make sense to have a more generic decorator name so that it
> could also be useable for other efficient vectorized format in the future?
> Or do we anticipate the decorator to be format specific and will have more
> in the future?
> >
> > From: Reynold Xin <rx...@databricks.com>
> > Sent: Friday, September 1, 2017 5:16:11 AM
> > To: Takuya UESHIN
> > Cc: spark-dev
> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
> >
> > Ok, thanks.
> >
> > +1 on the SPIP for scope etc
> >
> >
> > On API details (will deal with in code reviews as well but leaving a
> note here in case I forget)
> >
> > 1. I would suggest having the API also accept data type specification in
> string form. It is usually simpler to say "long" then "LongType()".
> >
> > 2. Think about what error message to show when the rows numbers don't
> match at runtime.
> >
> >
> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
> wrote:
> > Yes, the aggregation is out of scope for now.
> > I think we should continue discussing the aggregation at JIRA and we
> will be adding those later separately.
> >
> > Thanks.
> >
> >
> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com> wrote:
> > Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
> >
> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
> wrote:
> > Hi all,
> >
> > We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
> >
> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
> for vectorized UDAFs or Window operations.
> >
> > https://issues.apache.org/jira/browse/SPARK-21190
> >
> >
> > Proposed API
> >
> > We introduce a @pandas_udf decorator (or annotation) to define
> vectorized UDFs which takes one or more pandas.Series or one integer value
> meaning the length of the input value for 0-parameter UDFs. The return
> value should be pandas.Series of the specified type and the length of the
> returned value should be the same as input value.
> >
> > We can define vectorized UDFs as:
> >
> >   @pandas_udf(DoubleType())
> >   def plus(v1, v2):
> >       return v1 + v2
> >
> > or we can define as:
> >
> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> >
> > We can use it similar to row-by-row UDFs:
> >
> >   df.withColumn('sum', plus(df.v1, df.v2))
> >
> > As for 0-parameter UDFs, we can define and use as:
> >
> >   @pandas_udf(LongType())
> >   def f0(size):
> >       return pd.Series(1).repeat(size)
> >
> >   df.select(f0())
> >
> >
> >
> > The vote will be up for the next 72 hours. Please reply with your vote:
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> reasons.
> >
> > Thanks!
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> >
> > --
> > Sameer Agarwal
> > Software Engineer | Databricks Inc.
> > http://cs.berkeley.edu/~sameerag
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Matei Zaharia <ma...@gmail.com>.

+1 (binding)

> On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon <gu...@gmail.com> wrote:
> 
> +1 (non-binding)
> 
> 
> 2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>:
> +1
> 
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sa...@databricks.com> wrote:
> +1 (non-binding)
> 
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cu...@gmail.com> wrote:
> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's fine to work out the minor details of the API during review.
> 
> Bryan
> 
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st> wrote:
> Hi all,
> 
> Thank you for voting and suggestions.
> 
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size hint, I'd like to submit a pr based on the current proposal and continue discussing in its review.
> 
>     https://github.com/apache/spark/pull/19147
> 
> I'd keep this vote open to wait for more opinions.
> 
> Thanks.
> 
> 
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com> wrote:
> +1 on the design and proposed API.
> 
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify the size hint. This can be done in the PR review though.
> 
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com> wrote:
> +1 on this and like the suggestion of type in string form.
> 
> Would it be correct to assume there will be data type check, for example the returned pandas data frame column data types match what are specified. We have seen quite a bit of issues/confusions with that in R.
> 
> Would it make sense to have a more generic decorator name so that it could also be useable for other efficient vectorized format in the future? Or do we anticipate the decorator to be format specific and will have more in the future?
> 
> From: Reynold Xin <rx...@databricks.com>
> Sent: Friday, September 1, 2017 5:16:11 AM
> To: Takuya UESHIN
> Cc: spark-dev
> Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>  
> Ok, thanks.
> 
> +1 on the SPIP for scope etc
> 
> 
> On API details (will deal with in code reviews as well but leaving a note here in case I forget)
> 
> 1. I would suggest having the API also accept data type specification in string form. It is usually simpler to say "long" then "LongType()". 
> 
> 2. Think about what error message to show when the rows numbers don't match at runtime. 
> 
> 
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st> wrote:
> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will be adding those later separately.
> 
> Thanks.
> 
> 
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com> wrote:
> Is the idea aggregate is out of scope for the current effort and we will be adding those later?
> 
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st> wrote:
> Hi all,
> 
> We've been discussing to support vectorized UDFs in Python and we almost got a consensus about the APIs, so I'd like to summarize and call for a vote.
> 
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations.
> 
> https://issues.apache.org/jira/browse/SPARK-21190
> 
> 
> Proposed API
> 
> We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs which takes one or more pandas.Series or one integer value meaning the length of the input value for 0-parameter UDFs. The return value should be pandas.Series of the specified type and the length of the returned value should be the same as input value.
> 
> We can define vectorized UDFs as:
> 
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>       return v1 + v2
> 
> or we can define as:
> 
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> 
> We can use it similar to row-by-row UDFs:
> 
>   df.withColumn('sum', plus(df.v1, df.v2))
> 
> As for 0-parameter UDFs, we can define and use as:
> 
>   @pandas_udf(LongType())
>   def f0(size):
>       return pd.Series(1).repeat(size)
> 
>   df.select(f0())
> 
> 
> 
> The vote will be up for the next 72 hours. Please reply with your vote:
> 
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical reasons.
> 
> Thanks!
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> 
> -- 
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Hyukjin Kwon <gu...@gmail.com>.

+1 (non-binding)


2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>:

> +1
>
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sa...@databricks.com>
> wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>>> fine to work out the minor details of the API during review.
>>>
>>> Bryan
>>>
>>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Thank you for voting and suggestions.
>>>>
>>>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>>>> discuss the size hint for the 0-parameter UDF.
>>>> But I believe we got a consensus about the basic APIs except for the
>>>> size hint, I'd like to submit a pr based on the current proposal and
>>>> continue discussing in its review.
>>>>
>>>>     https://github.com/apache/spark/pull/19147
>>>>
>>>> I'd keep this vote open to wait for more opinions.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 on the design and proposed API.
>>>>>
>>>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>>>> specify the size hint. This can be done in the PR review though.
>>>>>
>>>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <
>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>
>>>>>> +1 on this and like the suggestion of type in string form.
>>>>>>
>>>>>> Would it be correct to assume there will be data type check, for
>>>>>> example the returned pandas data frame column data types match what are
>>>>>> specified. We have seen quite a bit of issues/confusions with that in R.
>>>>>>
>>>>>> Would it make sense to have a more generic decorator name so that it
>>>>>> could also be useable for other efficient vectorized format in the future?
>>>>>> Or do we anticipate the decorator to be format specific and will have more
>>>>>> in the future?
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Reynold Xin <rx...@databricks.com>
>>>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>>>>> *To:* Takuya UESHIN
>>>>>> *Cc:* spark-dev
>>>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>>>>
>>>>>> Ok, thanks.
>>>>>>
>>>>>> +1 on the SPIP for scope etc
>>>>>>
>>>>>>
>>>>>> On API details (will deal with in code reviews as well but leaving a
>>>>>> note here in case I forget)
>>>>>>
>>>>>> 1. I would suggest having the API also accept data type specification
>>>>>> in string form. It is usually simpler to say "long" then "LongType()".
>>>>>>
>>>>>> 2. Think about what error message to show when the rows numbers don't
>>>>>> match at runtime.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, the aggregation is out of scope for now.
>>>>>>> I think we should continue discussing the aggregation at JIRA and we
>>>>>>> will be adding those later separately.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>>>>> will be adding those later?
>>>>>>>>
>>>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <
>>>>>>>> ueshin@happy-camper.st> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>>>>> almost got a consensus about the APIs, so I'd like to summarize
>>>>>>>>> and call for a vote.
>>>>>>>>>
>>>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Proposed API*
>>>>>>>>>
>>>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>>>>> vectorized UDFs which takes one or more pandas.Series or one
>>>>>>>>> integer value meaning the length of the input value for 0-parameter UDFs.
>>>>>>>>> The return value should be pandas.Series of the specified type
>>>>>>>>> and the length of the returned value should be the same as input value.
>>>>>>>>>
>>>>>>>>> We can define vectorized UDFs as:
>>>>>>>>>
>>>>>>>>>   @pandas_udf(DoubleType())
>>>>>>>>>   def plus(v1, v2):
>>>>>>>>>       return v1 + v2
>>>>>>>>>
>>>>>>>>> or we can define as:
>>>>>>>>>
>>>>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>>>>
>>>>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>>>>
>>>>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>>>>
>>>>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>>>>
>>>>>>>>>   @pandas_udf(LongType())
>>>>>>>>>   def f0(size):
>>>>>>>>>       return pd.Series(1).repeat(size)
>>>>>>>>>
>>>>>>>>>   df.select(f0())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>>>> vote:
>>>>>>>>>
>>>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>> +0: Don't really care.
>>>>>>>>> -1: I don't think this is a good idea because of the following technical
>>>>>>>>> reasons.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Takuya UESHIN
>>>>>>>>> Tokyo, Japan
>>>>>>>>>
>>>>>>>>> http://twitter.com/ueshin
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Takuya UESHIN
>>>>>>> Tokyo, Japan
>>>>>>>
>>>>>>> http://twitter.com/ueshin
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>>
>>
>>
>> --
>> Sameer Agarwal
>> Software Engineer | Databricks Inc.
>> http://cs.berkeley.edu/~sameerag
>>
>
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Yin Huai <yh...@databricks.com>.

+1

On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sa...@databricks.com>
wrote:

> +1 (non-binding)
>
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>> fine to work out the minor details of the API during review.
>>
>> Bryan
>>
>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thank you for voting and suggestions.
>>>
>>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>>> discuss the size hint for the 0-parameter UDF.
>>> But I believe we got a consensus about the basic APIs except for the
>>> size hint, I'd like to submit a pr based on the current proposal and
>>> continue discussing in its review.
>>>
>>>     https://github.com/apache/spark/pull/19147
>>>
>>> I'd keep this vote open to wait for more opinions.
>>>
>>> Thanks.
>>>
>>>
>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> +1 on the design and proposed API.
>>>>
>>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>>> specify the size hint. This can be done in the PR review though.
>>>>
>>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheung_m@hotmail.com
>>>> > wrote:
>>>>
>>>>> +1 on this and like the suggestion of type in string form.
>>>>>
>>>>> Would it be correct to assume there will be data type check, for
>>>>> example the returned pandas data frame column data types match what are
>>>>> specified. We have seen quite a bit of issues/confusions with that in R.
>>>>>
>>>>> Would it make sense to have a more generic decorator name so that it
>>>>> could also be useable for other efficient vectorized format in the future?
>>>>> Or do we anticipate the decorator to be format specific and will have more
>>>>> in the future?
>>>>>
>>>>> ------------------------------
>>>>> *From:* Reynold Xin <rx...@databricks.com>
>>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>>>> *To:* Takuya UESHIN
>>>>> *Cc:* spark-dev
>>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>>>
>>>>> Ok, thanks.
>>>>>
>>>>> +1 on the SPIP for scope etc
>>>>>
>>>>>
>>>>> On API details (will deal with in code reviews as well but leaving a
>>>>> note here in case I forget)
>>>>>
>>>>> 1. I would suggest having the API also accept data type specification
>>>>> in string form. It is usually simpler to say "long" then "LongType()".
>>>>>
>>>>> 2. Think about what error message to show when the rows numbers don't
>>>>> match at runtime.
>>>>>
>>>>>
>>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
>>>>> wrote:
>>>>>
>>>>>> Yes, the aggregation is out of scope for now.
>>>>>> I think we should continue discussing the aggregation at JIRA and we
>>>>>> will be adding those later separately.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>>>> will be adding those later?
>>>>>>>
>>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>>>> almost got a consensus about the APIs, so I'd like to summarize
>>>>>>>> and call for a vote.
>>>>>>>>
>>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>>>
>>>>>>>>
>>>>>>>> *Proposed API*
>>>>>>>>
>>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>>>> vectorized UDFs which takes one or more pandas.Series or one
>>>>>>>> integer value meaning the length of the input value for 0-parameter UDFs.
>>>>>>>> The return value should be pandas.Series of the specified type and
>>>>>>>> the length of the returned value should be the same as input value.
>>>>>>>>
>>>>>>>> We can define vectorized UDFs as:
>>>>>>>>
>>>>>>>>   @pandas_udf(DoubleType())
>>>>>>>>   def plus(v1, v2):
>>>>>>>>       return v1 + v2
>>>>>>>>
>>>>>>>> or we can define as:
>>>>>>>>
>>>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>>>
>>>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>>>
>>>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>>>
>>>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>>>
>>>>>>>>   @pandas_udf(LongType())
>>>>>>>>   def f0(size):
>>>>>>>>       return pd.Series(1).repeat(size)
>>>>>>>>
>>>>>>>>   df.select(f0())
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>>> vote:
>>>>>>>>
>>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>> +0: Don't really care.
>>>>>>>> -1: I don't think this is a good idea because of the following technical
>>>>>>>> reasons.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Takuya UESHIN
>>>>>>>> Tokyo, Japan
>>>>>>>>
>>>>>>>> http://twitter.com/ueshin
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Takuya UESHIN
>>>>>> Tokyo, Japan
>>>>>>
>>>>>> http://twitter.com/ueshin
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Sameer Agarwal <sa...@databricks.com>.

+1 (non-binding)

On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cu...@gmail.com> wrote:

> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
> fine to work out the minor details of the API during review.
>
> Bryan
>
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st>
> wrote:
>
>> Hi all,
>>
>> Thank you for voting and suggestions.
>>
>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> But I believe we got a consensus about the basic APIs except for the size
>> hint, I'd like to submit a pr based on the current proposal and continue
>> discussing in its review.
>>
>>     https://github.com/apache/spark/pull/19147
>>
>> I'd keep this vote open to wait for more opinions.
>>
>> Thanks.
>>
>>
>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> +1 on the design and proposed API.
>>>
>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>> specify the size hint. This can be done in the PR review though.
>>>
>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>>> +1 on this and like the suggestion of type in string form.
>>>>
>>>> Would it be correct to assume there will be data type check, for
>>>> example the returned pandas data frame column data types match what are
>>>> specified. We have seen quite a bit of issues/confusions with that in R.
>>>>
>>>> Would it make sense to have a more generic decorator name so that it
>>>> could also be useable for other efficient vectorized format in the future?
>>>> Or do we anticipate the decorator to be format specific and will have more
>>>> in the future?
>>>>
>>>> ------------------------------
>>>> *From:* Reynold Xin <rx...@databricks.com>
>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>>> *To:* Takuya UESHIN
>>>> *Cc:* spark-dev
>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>>
>>>> Ok, thanks.
>>>>
>>>> +1 on the SPIP for scope etc
>>>>
>>>>
>>>> On API details (will deal with in code reviews as well but leaving a
>>>> note here in case I forget)
>>>>
>>>> 1. I would suggest having the API also accept data type specification
>>>> in string form. It is usually simpler to say "long" then "LongType()".
>>>>
>>>> 2. Think about what error message to show when the rows numbers don't
>>>> match at runtime.
>>>>
>>>>
>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
>>>> wrote:
>>>>
>>>>> Yes, the aggregation is out of scope for now.
>>>>> I think we should continue discussing the aggregation at JIRA and we
>>>>> will be adding those later separately.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>>> will be adding those later?
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>>>>> call for a vote.
>>>>>>>
>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>>
>>>>>>>
>>>>>>> *Proposed API*
>>>>>>>
>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>>> vectorized UDFs which takes one or more pandas.Series or one
>>>>>>> integer value meaning the length of the input value for 0-parameter UDFs.
>>>>>>> The return value should be pandas.Series of the specified type and
>>>>>>> the length of the returned value should be the same as input value.
>>>>>>>
>>>>>>> We can define vectorized UDFs as:
>>>>>>>
>>>>>>>   @pandas_udf(DoubleType())
>>>>>>>   def plus(v1, v2):
>>>>>>>       return v1 + v2
>>>>>>>
>>>>>>> or we can define as:
>>>>>>>
>>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>>
>>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>>
>>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>>
>>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>>
>>>>>>>   @pandas_udf(LongType())
>>>>>>>   def f0(size):
>>>>>>>       return pd.Series(1).repeat(size)
>>>>>>>
>>>>>>>   df.select(f0())
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>> vote:
>>>>>>>
>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>> +0: Don't really care.
>>>>>>> -1: I don't think this is a good idea because of the following technical
>>>>>>> reasons.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> --
>>>>>>> Takuya UESHIN
>>>>>>> Tokyo, Japan
>>>>>>>
>>>>>>> http://twitter.com/ueshin
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Takuya UESHIN
>>>>> Tokyo, Japan
>>>>>
>>>>> http://twitter.com/ueshin
>>>>>
>>>>
>>>
>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>
>


-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Bryan Cutler <cu...@gmail.com>.

+1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
fine to work out the minor details of the API during review.

Bryan

On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ue...@happy-camper.st>
wrote:

> Hi all,
>
> Thank you for voting and suggestions.
>
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size
> hint, I'd like to submit a pr based on the current proposal and continue
> discussing in its review.
>
>     https://github.com/apache/spark/pull/19147
>
> I'd keep this vote open to wait for more opinions.
>
> Thanks.
>
>
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com> wrote:
>
>> +1 on the design and proposed API.
>>
>> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
>> the size hint. This can be done in the PR review though.
>>
>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> +1 on this and like the suggestion of type in string form.
>>>
>>> Would it be correct to assume there will be data type check, for example
>>> the returned pandas data frame column data types match what are specified.
>>> We have seen quite a bit of issues/confusions with that in R.
>>>
>>> Would it make sense to have a more generic decorator name so that it
>>> could also be useable for other efficient vectorized format in the future?
>>> Or do we anticipate the decorator to be format specific and will have more
>>> in the future?
>>>
>>> ------------------------------
>>> *From:* Reynold Xin <rx...@databricks.com>
>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>> *To:* Takuya UESHIN
>>> *Cc:* spark-dev
>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>
>>> Ok, thanks.
>>>
>>> +1 on the SPIP for scope etc
>>>
>>>
>>> On API details (will deal with in code reviews as well but leaving a
>>> note here in case I forget)
>>>
>>> 1. I would suggest having the API also accept data type specification in
>>> string form. It is usually simpler to say "long" then "LongType()".
>>>
>>> 2. Think about what error message to show when the rows numbers don't
>>> match at runtime.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
>>> wrote:
>>>
>>>> Yes, the aggregation is out of scope for now.
>>>> I think we should continue discussing the aggregation at JIRA and we
>>>> will be adding those later separately.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>> will be adding those later?
>>>>>
>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>>>> call for a vote.
>>>>>>
>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>
>>>>>>
>>>>>> *Proposed API*
>>>>>>
>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>>>> return value should be pandas.Series of the specified type and the
>>>>>> length of the returned value should be the same as input value.
>>>>>>
>>>>>> We can define vectorized UDFs as:
>>>>>>
>>>>>>   @pandas_udf(DoubleType())
>>>>>>   def plus(v1, v2):
>>>>>>       return v1 + v2
>>>>>>
>>>>>> or we can define as:
>>>>>>
>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>
>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>
>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>
>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>
>>>>>>   @pandas_udf(LongType())
>>>>>>   def f0(size):
>>>>>>       return pd.Series(1).repeat(size)
>>>>>>
>>>>>>   df.select(f0())
>>>>>>
>>>>>>
>>>>>>
>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>> vote:
>>>>>>
>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>> +0: Don't really care.
>>>>>> -1: I don't think this is a good idea because of the following technical
>>>>>> reasons.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>> Takuya UESHIN
>>>>>> Tokyo, Japan
>>>>>>
>>>>>> http://twitter.com/ueshin
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Takuya UESHIN <ue...@happy-camper.st>.

Hi all,

Thank you for voting and suggestions.

As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
the size hint for the 0-parameter UDF.
But I believe we got a consensus about the basic APIs except for the size
hint, I'd like to submit a pr based on the current proposal and continue
discussing in its review.

    https://github.com/apache/spark/pull/19147

I'd keep this vote open to wait for more opinions.

Thanks.


On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cl...@gmail.com> wrote:

> +1 on the design and proposed API.
>
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
> the size hint. This can be done in the PR review though.
>
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> +1 on this and like the suggestion of type in string form.
>>
>> Would it be correct to assume there will be data type check, for example
>> the returned pandas data frame column data types match what are specified.
>> We have seen quite a bit of issues/confusions with that in R.
>>
>> Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the future?
>> Or do we anticipate the decorator to be format specific and will have more
>> in the future?
>>
>> ------------------------------
>> *From:* Reynold Xin <rx...@databricks.com>
>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>> *To:* Takuya UESHIN
>> *Cc:* spark-dev
>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>
>> Ok, thanks.
>>
>> +1 on the SPIP for scope etc
>>
>>
>> On API details (will deal with in code reviews as well but leaving a note
>> here in case I forget)
>>
>> 1. I would suggest having the API also accept data type specification in
>> string form. It is usually simpler to say "long" then "LongType()".
>>
>> 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>>
>>
>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
>> wrote:
>>
>>> Yes, the aggregation is out of scope for now.
>>> I think we should continue discussing the aggregation at JIRA and we
>>> will be adding those later separately.
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> Is the idea aggregate is out of scope for the current effort and we
>>>> will be adding those later?
>>>>
>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>>> call for a vote.
>>>>>
>>>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>>>> for vectorized UDAFs or Window operations.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>
>>>>>
>>>>> *Proposed API*
>>>>>
>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>>> return value should be pandas.Series of the specified type and the
>>>>> length of the returned value should be the same as input value.
>>>>>
>>>>> We can define vectorized UDFs as:
>>>>>
>>>>>   @pandas_udf(DoubleType())
>>>>>   def plus(v1, v2):
>>>>>       return v1 + v2
>>>>>
>>>>> or we can define as:
>>>>>
>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>
>>>>> We can use it similar to row-by-row UDFs:
>>>>>
>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>
>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>
>>>>>   @pandas_udf(LongType())
>>>>>   def f0(size):
>>>>>       return pd.Series(1).repeat(size)
>>>>>
>>>>>   df.select(f0())
>>>>>
>>>>>
>>>>>
>>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>>
>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>> +0: Don't really care.
>>>>> -1: I don't think this is a good idea because of the following technical
>>>>> reasons.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --
>>>>> Takuya UESHIN
>>>>> Tokyo, Japan
>>>>>
>>>>> http://twitter.com/ueshin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Wenchen Fan <cl...@gmail.com>.

+1 on the design and proposed API.

One detail I'd like to discuss is the 0-parameter UDF, how we can specify
the size hint. This can be done in the PR review though.

On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <fe...@hotmail.com>
wrote:

> +1 on this and like the suggestion of type in string form.
>
> Would it be correct to assume there will be data type check, for example
> the returned pandas data frame column data types match what are specified.
> We have seen quite a bit of issues/confusions with that in R.
>
> Would it make sense to have a more generic decorator name so that it could
> also be useable for other efficient vectorized format in the future? Or do
> we anticipate the decorator to be format specific and will have more in the
> future?
>
> ------------------------------
> *From:* Reynold Xin <rx...@databricks.com>
> *Sent:* Friday, September 1, 2017 5:16:11 AM
> *To:* Takuya UESHIN
> *Cc:* spark-dev
> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>
> Ok, thanks.
>
> +1 on the SPIP for scope etc
>
>
> On API details (will deal with in code reviews as well but leaving a note
> here in case I forget)
>
> 1. I would suggest having the API also accept data type specification in
> string form. It is usually simpler to say "long" then "LongType()".
>
> 2. Think about what error message to show when the rows numbers don't
> match at runtime.
>
>
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
> wrote:
>
>> Yes, the aggregation is out of scope for now.
>> I think we should continue discussing the aggregation at JIRA and we will
>> be adding those later separately.
>>
>> Thanks.
>>
>>
>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Is the idea aggregate is out of scope for the current effort and we will
>>> be adding those later?
>>>
>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We've been discussing to support vectorized UDFs in Python and we
>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>> call for a vote.
>>>>
>>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>>> for vectorized UDAFs or Window operations.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>
>>>>
>>>> *Proposed API*
>>>>
>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>> return value should be pandas.Series of the specified type and the
>>>> length of the returned value should be the same as input value.
>>>>
>>>> We can define vectorized UDFs as:
>>>>
>>>>   @pandas_udf(DoubleType())
>>>>   def plus(v1, v2):
>>>>       return v1 + v2
>>>>
>>>> or we can define as:
>>>>
>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>
>>>> We can use it similar to row-by-row UDFs:
>>>>
>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>
>>>> As for 0-parameter UDFs, we can define and use as:
>>>>
>>>>   @pandas_udf(LongType())
>>>>   def f0(size):
>>>>       return pd.Series(1).repeat(size)
>>>>
>>>>   df.select(f0())
>>>>
>>>>
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following technical
>>>> reasons.
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Felix Cheung <fe...@hotmail.com>.

+1 on this and like the suggestion of type in string form.

Would it be correct to assume there will be data type check, for example the returned pandas data frame column data types match what are specified. We have seen quite a bit of issues/confusions with that in R.

Would it make sense to have a more generic decorator name so that it could also be useable for other efficient vectorized format in the future? Or do we anticipate the decorator to be format specific and will have more in the future?

________________________________
From: Reynold Xin <rx...@databricks.com>
Sent: Friday, September 1, 2017 5:16:11 AM
To: Takuya UESHIN
Cc: spark-dev
Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Ok, thanks.

+1 on the SPIP for scope etc

On API details (will deal with in code reviews as well but leaving a note here in case I forget)

1. I would suggest having the API also accept data type specification in string form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match at runtime.

On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>> wrote:
Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will be adding those later separately.

Thanks.

On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com>> wrote:
Is the idea aggregate is out of scope for the current effort and we will be adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>> wrote:
Hi all,

We've been discussing to support vectorized UDFs in Python and we almost got a consensus about the APIs, so I'd like to summarize and call for a vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations.

https://issues.apache.org/jira/browse/SPARK-21190

Proposed API

We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs which takes one or more pandas.Series or one integer value meaning the length of the input value for 0-parameter UDFs. The return value should be pandas.Series of the specified type and the length of the returned value should be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
      return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
      return pd.Series(1).repeat(size)

  df.select(f0())

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Reynold Xin <rx...@databricks.com>.

Ok, thanks.

+1 on the SPIP for scope etc


On API details (will deal with in code reviews as well but leaving a note
here in case I forget)

1. I would suggest having the API also accept data type specification in
string form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match
at runtime.


On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ue...@happy-camper.st>
wrote:

> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will
> be adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Is the idea aggregate is out of scope for the current effort and we will
>> be adding those later?
>>
>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
>> wrote:
>>
>>> Hi all,
>>>
>>> We've been discussing to support vectorized UDFs in Python and we almost
>>> got a consensus about the APIs, so I'd like to summarize and call for a
>>> vote.
>>>
>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>> for vectorized UDAFs or Window operations.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>
>>>
>>> *Proposed API*
>>>
>>> We introduce a @pandas_udf decorator (or annotation) to define
>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>> value meaning the length of the input value for 0-parameter UDFs. The
>>> return value should be pandas.Series of the specified type and the
>>> length of the returned value should be the same as input value.
>>>
>>> We can define vectorized UDFs as:
>>>
>>>   @pandas_udf(DoubleType())
>>>   def plus(v1, v2):
>>>       return v1 + v2
>>>
>>> or we can define as:
>>>
>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>
>>> We can use it similar to row-by-row UDFs:
>>>
>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>
>>> As for 0-parameter UDFs, we can define and use as:
>>>
>>>   @pandas_udf(LongType())
>>>   def f0(size):
>>>       return pd.Series(1).repeat(size)
>>>
>>>   df.select(f0())
>>>
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Takuya UESHIN <ue...@happy-camper.st>.

Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will
be adding those later separately.

Thanks.


On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rx...@databricks.com> wrote:

> Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st>
> wrote:
>
>> Hi all,
>>
>> We've been discussing to support vectorized UDFs in Python and we almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>>
>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>>
>> https://issues.apache.org/jira/browse/SPARK-21190
>>
>>
>> *Proposed API*
>>
>> We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value meaning the length of the input value for 0-parameter UDFs. The
>> return value should be pandas.Series of the specified type and the
>> length of the returned value should be the same as input value.
>>
>> We can define vectorized UDFs as:
>>
>>   @pandas_udf(DoubleType())
>>   def plus(v1, v2):
>>       return v1 + v2
>>
>> or we can define as:
>>
>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>
>> We can use it similar to row-by-row UDFs:
>>
>>   df.withColumn('sum', plus(df.v1, df.v2))
>>
>> As for 0-parameter UDFs, we can define and use as:
>>
>>   @pandas_udf(LongType())
>>   def f0(size):
>>       return pd.Series(1).repeat(size)
>>
>>   df.select(f0())
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Posted by Reynold Xin <rx...@databricks.com>.

Is the idea aggregate is out of scope for the current effort and we will be
adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ue...@happy-camper.st> wrote:

> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for
> vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> *Proposed API*
>
> We introduce a @pandas_udf decorator (or annotation) to define vectorized
> UDFs which takes one or more pandas.Series or one integer value meaning
> the length of the input value for 0-parameter UDFs. The return value should
> be pandas.Series of the specified type and the length of the returned
> value should be the same as input value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>       return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>       return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>