You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Dean Chen <de...@dv01.co> on 2017/05/14 17:26:16 UTC

Improve SparkR collect performance with Arrow

Following up on the discussion from
https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
cases that would benefit significantly from improved collect performance
and would like to kick off a similar proposal/effort to
https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.

Complex datatypes introduced additional complexity to 13534 and it's not a
requirement for us so thinking the initial proposal would be for simple
types with fall back on the current implementation for complex types.

Integration would involve introducing a flag to enable the arrow
serialization logic *collect*(
https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129)
that would call an Arrow implementation of *dfToCols*
https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
that
returns Arrow byte arrays.

Looks like https://github.com/wesm/feather hasn't been updated since the
Arrow 0.3 release so assuming it would have to be updated to enable
converting the byte array from dfToCols to R dataframes? Wes also brought
up that unified serialization implementation for Spark/Scala, R and Python
to enable easy sharing of IO optimizations.

Please let us know your thoughts/opinions on the above and the preferred
way of collaborating with the Arrow community on this.
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
dean@dv01.co | www.dv01.co

Re: Improve SparkR collect performance with Arrow

Posted by Felix Cheung <fe...@hotmail.com>.

I can try to help.


_____________________________
From: Wes McKinney <we...@gmail.com>
Sent: Monday, May 15, 2017 12:49 PM
Subject: Re: Improve SparkR collect performance with Arrow
To: Dirk Eddelbuettel <di...@eddelbuettel.com>, <de...@arrow.apache.org>, Jim Hester <ja...@rstudio.com>, Hadley Wickham <ha...@rstudio.com>, Kevin Ushey <ke...@rstudio.com>


Adding Hadley and others to the conversation to advise on the best path forward.

I am happy to help with maintenance of the C++ code. For example, if
there are API changes that affect the Rcpp bindings, I would help fix
them. We have GLib-based C and Cython bindings (which is like Rcpp for
Python), so this adds another binding layer to the mix which is no
problem.

I am eager to be doing work for the benefit of the R community, so
hopefully among all of us we can find a division of labor that will
advance this effort.

Thanks
Wes

On Mon, May 15, 2017 at 11:01 AM, Dean Chen <de...@dv01.co> wrote:
> Hi Wes,
>
> We can work with the Spark community on the Spark/SparkR integration.
>
> Also happy to help with migrating the R package from Feather in to Arrow.
>
> Have anyone in mind to manage the R/Rcpp binding issues? I reviewed the R
> and cpp files in https://github.com/wesm/feather/tree/master/R and we may
> be able to take a first pass on it to get things kicked off the ground.
> Will still want an expert with Rcpp to review and own since we're not
> experts with Rcpp and I'm sure it's riddled with lots of caveats like any
> other fdw.
>
> We maintaining lots of R packages internally and can help or take the lead
> on R packaging/builds/testing in travis in the Arrow project.
>
> On Sun, May 14, 2017 at 2:46 PM Wes McKinney <we...@gmail.com> wrote:
>
>> Note I just opened https://github.com/wesm/feather/pull/297 which deletes
>> all of the Feather Python code (using pyarrow as a dependency).
>>
>> On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> > hi Dean,
>> >
>> > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
>> > into the Arrow repo. The Feather format is a simplified version of the
>> > Arrow IPC format (which has file/batch and stream flavors), so the ideal
>> > approach would be to move the Feather R/Rcpp wrapper code into the Arrow
>> > codebase and generalize it to support the Arrow streams that are coming
>> > from Spark (as in SPARK-13534).
>> >
>> > Adding support for nested types should also be possible -- we have
>> > implemented more of the converters for them on the Python side. The
>> Feather
>> > format doesn't support nested types, so we would want to deprecate that
>> > format as soon as practical (Feather has plenty of users; and we can
>> always
>> > maintain the library(feather) import and associated R API).
>> >
>> > In any case, this seems like an ideal collaboration for the Spark and
>> > Arrow communities; what is missing is an experienced developer from the R
>> > community who can manage the R/Rcpp binding issues (I can help some with
>> > maintaining the C++ side of the bindings) and address packaging / builds
>> /
>> > continuous integration.
>> >
>> > - Wes
>> >
>> > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <de...@dv01.co> wrote:
>> >
>> >> Following up on the discussion from
>> >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
>> >> cases that would benefit significantly from improved collect performance
>> >> and would like to kick off a similar proposal/effort to
>> >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>> >>
>> >> Complex datatypes introduced additional complexity to 13534 and it's
>> not a
>> >> requirement for us so thinking the initial proposal would be for simple
>> >> types with fall back on the current implementation for complex types.
>> >>
>> >> Integration would involve introducing a flag to enable the arrow
>> >> serialization logic *collect*(
>> >>
>> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
>> >> )
>> >> that would call an Arrow implementation of *dfToCols*
>> >> https://github.com/apache/spark/blob/branch-2.2/sql/core/
>> >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
>> >> that
>> >> returns Arrow byte arrays.
>> >>
>> >> Looks like https://github.com/wesm/feather hasn't been updated since
>> the
>> >> Arrow 0.3 release so assuming it would have to be updated to enable
>> >> converting the byte array from dfToCols to R dataframes? Wes also
>> brought
>> >> up that unified serialization implementation for Spark/Scala, R and
>> Python
>> >> to enable easy sharing of IO optimizations.
>> >>
>> >> Please let us know your thoughts/opinions on the above and the preferred
>> >> way of collaborating with the Arrow community on this.
>> >> --
>> >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> >> <http://www.forbes.com/fintech/2016/#310668d56680>
>> >> 915 Broadway | Suite 502 | New York, NY 10010
>> >> (646)-838-2310 <(646)%20838-2310>
>> >> dean@dv01.co | www.dv01.co
>> >>
>> >
>> >
>>
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> dean@dv01.co | www.dv01.co

Re: Improve SparkR collect performance with Arrow

Posted by Wes McKinney <we...@gmail.com>.

Adding Hadley and others to the conversation to advise on the best path forward.

I am happy to help with maintenance of the C++ code. For example, if
there are API changes that affect the Rcpp bindings, I would help fix
them. We have GLib-based C and Cython bindings (which is like Rcpp for
Python), so this adds another binding layer to the mix which is no
problem.

I am eager to be doing work for the benefit of the R community, so
hopefully among all of us we can find a division of labor that will
advance this effort.

Thanks
Wes

On Mon, May 15, 2017 at 11:01 AM, Dean Chen <de...@dv01.co> wrote:
> Hi Wes,
>
> We can work with the Spark community on the Spark/SparkR integration.
>
> Also happy to help with migrating the R package from Feather in to Arrow.
>
> Have anyone in mind to manage the R/Rcpp binding issues? I reviewed the R
> and cpp files in https://github.com/wesm/feather/tree/master/R and we may
> be able to take a first pass on it to get things kicked off the ground.
> Will still want an expert with Rcpp to review and own since we're not
> experts with Rcpp and I'm sure it's riddled with lots of caveats like any
> other fdw.
>
> We maintaining lots of R packages internally and can help or take the lead
> on R packaging/builds/testing in travis in the Arrow project.
>
> On Sun, May 14, 2017 at 2:46 PM Wes McKinney <we...@gmail.com> wrote:
>
>> Note I just opened https://github.com/wesm/feather/pull/297 which deletes
>> all of the Feather Python code (using pyarrow as a dependency).
>>
>> On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> > hi Dean,
>> >
>> > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
>> > into the Arrow repo. The Feather format is a simplified version of the
>> > Arrow IPC format (which has file/batch and stream flavors), so the ideal
>> > approach would be to move the Feather R/Rcpp wrapper code into the Arrow
>> > codebase and generalize it to support the Arrow streams that are coming
>> > from Spark (as in SPARK-13534).
>> >
>> > Adding support for nested types should also be possible -- we have
>> > implemented more of the converters for them on the Python side. The
>> Feather
>> > format doesn't support nested types, so we would want to deprecate that
>> > format as soon as practical (Feather has plenty of users; and we can
>> always
>> > maintain the library(feather) import and associated R API).
>> >
>> > In any case, this seems like an ideal collaboration for the Spark and
>> > Arrow communities; what is missing is an experienced developer from the R
>> > community who can manage the R/Rcpp binding issues (I can help some with
>> > maintaining the C++ side of the bindings) and address packaging / builds
>> /
>> > continuous integration.
>> >
>> > - Wes
>> >
>> > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <de...@dv01.co> wrote:
>> >
>> >> Following up on the discussion from
>> >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
>> >> cases that would benefit significantly from improved collect performance
>> >> and would like to kick off a similar proposal/effort to
>> >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>> >>
>> >> Complex datatypes introduced additional complexity to 13534 and it's
>> not a
>> >> requirement for us so thinking the initial proposal would be for simple
>> >> types with fall back on the current implementation for complex types.
>> >>
>> >> Integration would involve introducing a flag to enable the arrow
>> >> serialization logic *collect*(
>> >>
>> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
>> >> )
>> >> that would call an Arrow implementation of *dfToCols*
>> >> https://github.com/apache/spark/blob/branch-2.2/sql/core/
>> >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
>> >> that
>> >> returns Arrow byte arrays.
>> >>
>> >> Looks like https://github.com/wesm/feather hasn't been updated since
>> the
>> >> Arrow 0.3 release so assuming it would have to be updated to enable
>> >> converting the byte array from dfToCols to R dataframes? Wes also
>> brought
>> >> up that unified serialization implementation for Spark/Scala, R and
>> Python
>> >> to enable easy sharing of IO optimizations.
>> >>
>> >> Please let us know your thoughts/opinions on the above and the preferred
>> >> way of collaborating with the Arrow community on this.
>> >> --
>> >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> >> <http://www.forbes.com/fintech/2016/#310668d56680>
>> >> 915 Broadway | Suite 502 | New York, NY 10010
>> >> (646)-838-2310 <(646)%20838-2310>
>> >> dean@dv01.co | www.dv01.co
>> >>
>> >
>> >
>>
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> dean@dv01.co | www.dv01.co

Re: Improve SparkR collect performance with Arrow

Posted by Dean Chen <de...@dv01.co>.

Hi Wes,

We can work with the Spark community on the Spark/SparkR integration.

Also happy to help with migrating the R package from Feather in to Arrow.

Have anyone in mind to manage the R/Rcpp binding issues? I reviewed the R
and cpp files in https://github.com/wesm/feather/tree/master/R and we may
be able to take a first pass on it to get things kicked off the ground.
Will still want an expert with Rcpp to review and own since we're not
experts with Rcpp and I'm sure it's riddled with lots of caveats like any
other fdw.

We maintaining lots of R packages internally and can help or take the lead
on R packaging/builds/testing in travis in the Arrow project.

On Sun, May 14, 2017 at 2:46 PM Wes McKinney <we...@gmail.com> wrote:

> Note I just opened https://github.com/wesm/feather/pull/297 which deletes
> all of the Feather Python code (using pyarrow as a dependency).
>
> On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>
> > hi Dean,
> >
> > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
> > into the Arrow repo. The Feather format is a simplified version of the
> > Arrow IPC format (which has file/batch and stream flavors), so the ideal
> > approach would be to move the Feather R/Rcpp wrapper code into the Arrow
> > codebase and generalize it to support the Arrow streams that are coming
> > from Spark (as in SPARK-13534).
> >
> > Adding support for nested types should also be possible -- we have
> > implemented more of the converters for them on the Python side. The
> Feather
> > format doesn't support nested types, so we would want to deprecate that
> > format as soon as practical (Feather has plenty of users; and we can
> always
> > maintain the library(feather) import and associated R API).
> >
> > In any case, this seems like an ideal collaboration for the Spark and
> > Arrow communities; what is missing is an experienced developer from the R
> > community who can manage the R/Rcpp binding issues (I can help some with
> > maintaining the C++ side of the bindings) and address packaging / builds
> /
> > continuous integration.
> >
> > - Wes
> >
> > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <de...@dv01.co> wrote:
> >
> >> Following up on the discussion from
> >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
> >> cases that would benefit significantly from improved collect performance
> >> and would like to kick off a similar proposal/effort to
> >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
> >>
> >> Complex datatypes introduced additional complexity to 13534 and it's
> not a
> >> requirement for us so thinking the initial proposal would be for simple
> >> types with fall back on the current implementation for complex types.
> >>
> >> Integration would involve introducing a flag to enable the arrow
> >> serialization logic *collect*(
> >>
> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
> >> )
> >> that would call an Arrow implementation of *dfToCols*
> >> https://github.com/apache/spark/blob/branch-2.2/sql/core/
> >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
> >> that
> >> returns Arrow byte arrays.
> >>
> >> Looks like https://github.com/wesm/feather hasn't been updated since
> the
> >> Arrow 0.3 release so assuming it would have to be updated to enable
> >> converting the byte array from dfToCols to R dataframes? Wes also
> brought
> >> up that unified serialization implementation for Spark/Scala, R and
> Python
> >> to enable easy sharing of IO optimizations.
> >>
> >> Please let us know your thoughts/opinions on the above and the preferred
> >> way of collaborating with the Arrow community on this.
> >> --
> >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> >> <http://www.forbes.com/fintech/2016/#310668d56680>
> >> 915 Broadway | Suite 502 | New York, NY 10010
> >> (646)-838-2310 <(646)%20838-2310>
> >> dean@dv01.co | www.dv01.co
> >>
> >
> >
>
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
dean@dv01.co | www.dv01.co

Re: Improve SparkR collect performance with Arrow

Posted by Wes McKinney <we...@gmail.com>.

Note I just opened https://github.com/wesm/feather/pull/297 which deletes
all of the Feather Python code (using pyarrow as a dependency).

On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Dean,
>
> In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
> into the Arrow repo. The Feather format is a simplified version of the
> Arrow IPC format (which has file/batch and stream flavors), so the ideal
> approach would be to move the Feather R/Rcpp wrapper code into the Arrow
> codebase and generalize it to support the Arrow streams that are coming
> from Spark (as in SPARK-13534).
>
> Adding support for nested types should also be possible -- we have
> implemented more of the converters for them on the Python side. The Feather
> format doesn't support nested types, so we would want to deprecate that
> format as soon as practical (Feather has plenty of users; and we can always
> maintain the library(feather) import and associated R API).
>
> In any case, this seems like an ideal collaboration for the Spark and
> Arrow communities; what is missing is an experienced developer from the R
> community who can manage the R/Rcpp binding issues (I can help some with
> maintaining the C++ side of the bindings) and address packaging / builds /
> continuous integration.
>
> - Wes
>
> On Sun, May 14, 2017 at 1:26 PM, Dean Chen <de...@dv01.co> wrote:
>
>> Following up on the discussion from
>> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
>> cases that would benefit significantly from improved collect performance
>> and would like to kick off a similar proposal/effort to
>> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>>
>> Complex datatypes introduced additional complexity to 13534 and it's not a
>> requirement for us so thinking the initial proposal would be for simple
>> types with fall back on the current implementation for complex types.
>>
>> Integration would involve introducing a flag to enable the arrow
>> serialization logic *collect*(
>> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
>> )
>> that would call an Arrow implementation of *dfToCols*
>> https://github.com/apache/spark/blob/branch-2.2/sql/core/
>> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
>> that
>> returns Arrow byte arrays.
>>
>> Looks like https://github.com/wesm/feather hasn't been updated since the
>> Arrow 0.3 release so assuming it would have to be updated to enable
>> converting the byte array from dfToCols to R dataframes? Wes also brought
>> up that unified serialization implementation for Spark/Scala, R and Python
>> to enable easy sharing of IO optimizations.
>>
>> Please let us know your thoughts/opinions on the above and the preferred
>> way of collaborating with the Arrow community on this.
>> --
>> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> <http://www.forbes.com/fintech/2016/#310668d56680>
>> 915 Broadway | Suite 502 | New York, NY 10010
>> (646)-838-2310
>> dean@dv01.co | www.dv01.co
>>
>
>

Re: Improve SparkR collect performance with Arrow

Posted by Wes McKinney <we...@gmail.com>.

hi Dean,

In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather into
the Arrow repo. The Feather format is a simplified version of the Arrow IPC
format (which has file/batch and stream flavors), so the ideal approach
would be to move the Feather R/Rcpp wrapper code into the Arrow codebase
and generalize it to support the Arrow streams that are coming from Spark
(as in SPARK-13534).

Adding support for nested types should also be possible -- we have
implemented more of the converters for them on the Python side. The Feather
format doesn't support nested types, so we would want to deprecate that
format as soon as practical (Feather has plenty of users; and we can always
maintain the library(feather) import and associated R API).

In any case, this seems like an ideal collaboration for the Spark and Arrow
communities; what is missing is an experienced developer from the R
community who can manage the R/Rcpp binding issues (I can help some with
maintaining the C++ side of the bindings) and address packaging / builds /
continuous integration.

- Wes

On Sun, May 14, 2017 at 1:26 PM, Dean Chen <de...@dv01.co> wrote:

> Following up on the discussion from
> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
> cases that would benefit significantly from improved collect performance
> and would like to kick off a similar proposal/effort to
> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>
> Complex datatypes introduced additional complexity to 13534 and it's not a
> requirement for us so thinking the initial proposal would be for simple
> types with fall back on the current implementation for complex types.
>
> Integration would involve introducing a flag to enable the arrow
> serialization logic *collect*(
> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129)
> that would call an Arrow implementation of *dfToCols*
> https://github.com/apache/spark/blob/branch-2.2/sql/
> core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
> that
> returns Arrow byte arrays.
>
> Looks like https://github.com/wesm/feather hasn't been updated since the
> Arrow 0.3 release so assuming it would have to be updated to enable
> converting the byte array from dfToCols to R dataframes? Wes also brought
> up that unified serialization implementation for Spark/Scala, R and Python
> to enable easy sharing of IO optimizations.
>
> Please let us know your thoughts/opinions on the above and the preferred
> way of collaborating with the Arrow community on this.
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> dean@dv01.co | www.dv01.co
>