You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jeffrey Wong <je...@netflix.com.INVALID> on 2019/01/11 05:54:46 UTC

Using arrow to efficiently transfer data from Python to R

Hello, I wanted to share a great experience I had with arrow, Python and R,
and possible contribute a package I wrote.

I work with many colleagues that build engineering systems in python. As a
data scientist I work almost entirely in R and Rcpp. To bridge the
engineering work with data science work my team uses rpy2. Our workflow is
usually this:

   1. Python is used to coordinate with other engineering systems
   2. Python ultimately gets a dataset that needs to be analyzed. This can
   take the form of parquet or arrow formatted data
   3. Using rpy2, python can pass data to R to be analyzed, and receive
   output from R at the end. Python can then coordinate with downstream systems

What we have found is that passing data from python to R can be quite slow.
Recently, I found that if python already has a pyarrow Table object in the
session, it can be passed to Rcpp as a SEXP through rpy2. Rcpp can use the
C++ arrow library to extract the underlying arrow Table object from the
pyarrow Table object, and materialize a data frame out of that. I have
found that I can transfer 20 million row datasets from python to R in ~10
seconds. This is particularly powerful when Python is already the driver of
the engineering systems, and the compute is pushed into R.

This is a huge advantage. Performance wise, this is the fastest way to
transfer data to R that we have seen. Culture wise, it means my engineering
and data science teams can collaborate much better as both teams operate on
arrow types.

I wrote an R package containing an Rcpp function
RcppReceiveArrowTableFromPython (link
<https://github.com/jeffwong-nflx/RcppPyArrow>) which receives the SEXP
from rpy2, unwraps the underlying arrow table, and then produces an R
dataframe. I am interested in contributing this package back to the Arrow
community, as I believe part of the spirit of Arrow is to facilitate
seamless data transfer across languages. Is the Arrow codebase a proper
home for such a package? This package has a dependency on having Python
headers and being able to link to libpython.so, will that complicate this
contribution?

-- 
Jeffrey Wong
Computational Causal Inference

Re: Using arrow to efficiently transfer data from Python to R

Posted by Wes McKinney <we...@gmail.com>.
It would be very useful to be able to pass objects both directions in
memory "by pointer". I had opened
https://issues.apache.org/jira/browse/ARROW-3750 previously. It may be
that what you've done is already the right way, but feel free to open
a pull request and we can have a look

cheers
Wes

On Fri, Jan 11, 2019 at 2:33 AM Romain Francois <ro...@purrple.cat> wrote:
>
> This is exactly (IMO) the kind of things this project is about. Thanks for looking into that, I'll try to incorporate the idea in the R package asap.
>
> Did you look at the reverse operation, i.e. promote an arrow object from R to python ?
>
> Romain
>
> > Le 11 janv. 2019 à 06:54, Jeffrey Wong <je...@netflix.com.INVALID> a écrit :
> >
> > Hello, I wanted to share a great experience I had with arrow, Python and R,
> > and possible contribute a package I wrote.
> >
> > I work with many colleagues that build engineering systems in python. As a
> > data scientist I work almost entirely in R and Rcpp. To bridge the
> > engineering work with data science work my team uses rpy2. Our workflow is
> > usually this:
> >
> >   1. Python is used to coordinate with other engineering systems
> >   2. Python ultimately gets a dataset that needs to be analyzed. This can
> >   take the form of parquet or arrow formatted data
> >   3. Using rpy2, python can pass data to R to be analyzed, and receive
> >   output from R at the end. Python can then coordinate with downstream systems
> >
> > What we have found is that passing data from python to R can be quite slow.
> > Recently, I found that if python already has a pyarrow Table object in the
> > session, it can be passed to Rcpp as a SEXP through rpy2. Rcpp can use the
> > C++ arrow library to extract the underlying arrow Table object from the
> > pyarrow Table object, and materialize a data frame out of that. I have
> > found that I can transfer 20 million row datasets from python to R in ~10
> > seconds. This is particularly powerful when Python is already the driver of
> > the engineering systems, and the compute is pushed into R.
> >
> > This is a huge advantage. Performance wise, this is the fastest way to
> > transfer data to R that we have seen. Culture wise, it means my engineering
> > and data science teams can collaborate much better as both teams operate on
> > arrow types.
> >
> > I wrote an R package containing an Rcpp function
> > RcppReceiveArrowTableFromPython (link
> > <https://github.com/jeffwong-nflx/RcppPyArrow>) which receives the SEXP
> > from rpy2, unwraps the underlying arrow table, and then produces an R
> > dataframe. I am interested in contributing this package back to the Arrow
> > community, as I believe part of the spirit of Arrow is to facilitate
> > seamless data transfer across languages. Is the Arrow codebase a proper
> > home for such a package? This package has a dependency on having Python
> > headers and being able to link to libpython.so, will that complicate this
> > contribution?
> >
> > --
> > Jeffrey Wong
> > Computational Causal Inference
>

Re: Using arrow to efficiently transfer data from Python to R

Posted by Romain Francois <ro...@purrple.cat>.
This is exactly (IMO) the kind of things this project is about. Thanks for looking into that, I'll try to incorporate the idea in the R package asap. 

Did you look at the reverse operation, i.e. promote an arrow object from R to python ?

Romain

> Le 11 janv. 2019 à 06:54, Jeffrey Wong <je...@netflix.com.INVALID> a écrit :
> 
> Hello, I wanted to share a great experience I had with arrow, Python and R,
> and possible contribute a package I wrote.
> 
> I work with many colleagues that build engineering systems in python. As a
> data scientist I work almost entirely in R and Rcpp. To bridge the
> engineering work with data science work my team uses rpy2. Our workflow is
> usually this:
> 
>   1. Python is used to coordinate with other engineering systems
>   2. Python ultimately gets a dataset that needs to be analyzed. This can
>   take the form of parquet or arrow formatted data
>   3. Using rpy2, python can pass data to R to be analyzed, and receive
>   output from R at the end. Python can then coordinate with downstream systems
> 
> What we have found is that passing data from python to R can be quite slow.
> Recently, I found that if python already has a pyarrow Table object in the
> session, it can be passed to Rcpp as a SEXP through rpy2. Rcpp can use the
> C++ arrow library to extract the underlying arrow Table object from the
> pyarrow Table object, and materialize a data frame out of that. I have
> found that I can transfer 20 million row datasets from python to R in ~10
> seconds. This is particularly powerful when Python is already the driver of
> the engineering systems, and the compute is pushed into R.
> 
> This is a huge advantage. Performance wise, this is the fastest way to
> transfer data to R that we have seen. Culture wise, it means my engineering
> and data science teams can collaborate much better as both teams operate on
> arrow types.
> 
> I wrote an R package containing an Rcpp function
> RcppReceiveArrowTableFromPython (link
> <https://github.com/jeffwong-nflx/RcppPyArrow>) which receives the SEXP
> from rpy2, unwraps the underlying arrow table, and then produces an R
> dataframe. I am interested in contributing this package back to the Arrow
> community, as I believe part of the spirit of Arrow is to facilitate
> seamless data transfer across languages. Is the Arrow codebase a proper
> home for such a package? This package has a dependency on having Python
> headers and being able to link to libpython.so, will that complicate this
> contribution?
> 
> -- 
> Jeffrey Wong
> Computational Causal Inference