You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Hyukjin Kwon <gu...@gmail.com> on 2019/02/09 13:41:10 UTC

Vectorized R gapply[Collect]() implementation

Guys, as continuation of Arrow optimization for R DataFrame to Spark
DataFrame,

I am trying to make a vectorized gapply[Collect] implementation as an
experiment like vectorized Pandas UDFs

It brought 820%+ performance improvement. See
https://github.com/apache/spark/pull/23746

Please come and take a look if you're interested in R APIs :D. I have
already cc'ed some people I know but please come, review and discuss for
both Spark side and Arrow side.

This Arrow optimization job is being done under
https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to
take one if anyone of you is interested in it.

Thanks.

Re: Vectorized R gapply[Collect]() implementation

Posted by Hyukjin Kwon <gu...@gmail.com>.

Thanks guys <3.

FYI, I made a PR for collect and vectorized dapply too.
Given my tests, it boosts up the speed 1500%+, and 4600%+ each.

https://github.com/apache/spark/pull/23760
https://github.com/apache/spark/pull/23787


2019년 2월 11일 (월) 오전 4:45, Felix Cheung <fe...@hotmail.com>님이 작성:

> This is super awesome!
>
>
> ------------------------------
> *From:* Shivaram Venkataraman <sh...@eecs.berkeley.edu>
> *Sent:* Saturday, February 9, 2019 8:33 AM
> *To:* Hyukjin Kwon
> *Cc:* dev; Felix Cheung; Bryan Cutler; Liang-Chi Hsieh; Shivaram
> Venkataraman
> *Subject:* Re: Vectorized R gapply[Collect]() implementation
>
> Those speedups look awesome! Great work Hyukjin!
>
> Thanks
> Shivaram
>
> On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon <gu...@gmail.com> wrote:
> >
> > Guys, as continuation of Arrow optimization for R DataFrame to Spark
> DataFrame,
> >
> > I am trying to make a vectorized gapply[Collect] implementation as an
> experiment like vectorized Pandas UDFs
> >
> > It brought 820%+ performance improvement. See
> https://github.com/apache/spark/pull/23746
> >
> > Please come and take a look if you're interested in R APIs :D. I have
> already cc'ed some people I know but please come, review and discuss for
> both Spark side and Arrow side.
> >
> > This Arrow optimization job is being done under
> https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to
> take one if anyone of you is interested in it.
> >
> > Thanks.
>

Re: Vectorized R gapply[Collect]() implementation

Posted by Felix Cheung <fe...@hotmail.com>.

This is super awesome!


________________________________
From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>
Sent: Saturday, February 9, 2019 8:33 AM
To: Hyukjin Kwon
Cc: dev; Felix Cheung; Bryan Cutler; Liang-Chi Hsieh; Shivaram Venkataraman
Subject: Re: Vectorized R gapply[Collect]() implementation

Those speedups look awesome! Great work Hyukjin!

Thanks
Shivaram

On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> Guys, as continuation of Arrow optimization for R DataFrame to Spark DataFrame,
>
> I am trying to make a vectorized gapply[Collect] implementation as an experiment like vectorized Pandas UDFs
>
> It brought 820%+ performance improvement. See https://github.com/apache/spark/pull/23746
>
> Please come and take a look if you're interested in R APIs :D. I have already cc'ed some people I know but please come, review and discuss for both Spark side and Arrow side.
>
> This Arrow optimization job is being done under https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to take one if anyone of you is interested in it.
>
> Thanks.

Re: Vectorized R gapply[Collect]() implementation

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Those speedups look awesome! Great work Hyukjin!

Thanks
Shivaram

On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> Guys, as continuation of Arrow optimization for R DataFrame to Spark DataFrame,
>
> I am trying to make a vectorized gapply[Collect] implementation as an experiment like vectorized Pandas UDFs
>
> It brought 820%+ performance improvement. See https://github.com/apache/spark/pull/23746
>
> Please come and take a look if you're interested in R APIs :D. I have already cc'ed some people I know but please come, review and discuss for both Spark side and Arrow side.
>
> This Arrow optimization job is being done under https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to take one if anyone of you is interested in it.
>
> Thanks.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org