You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Clark Fitzgerald <cl...@gmail.com> on 2017/07/19 17:44:13 UTC

Use case for R Arrow Bindings

Hello all,

I saw the notes come through from today's call:

> * R Arrow Bindings?
>  - Find use cases within the R community, contributors needed
>  - R Feather bindings a useful starting point

This year I've been working on parallel R on datasets in the 100+ GB range,
and have found that loading and saving data from text files is a real
bottleneck. Another consideration is breaking the data up into chunks for
parallel processing while maintaining metadata and overall structure. So
I've been watching Parquet and Arrow.

Specifically here are two use cases in R where Arrow / Parquet could be
helpful:

- Splitting up a large data set into pieces which fit comfortably in memory
then applying normal R functions to each piece. Basically GROUP BY.
- Matloff's Software Alchemy, statistical averaging based on independent
chunks of data. This requires rows to be randomly assigned to chunks.

Another option besides starting from the R Feather bindings is to start
with an automatically generated set of bindings:
https://github.com/duncantl/RCodeGen

Best,
Clark Fitzgerald

Re: Use case for R Arrow Bindings

Posted by Clark Fitzgerald <cl...@gmail.com>.

Great, I'll be on the call. The first steps I took today with the
automatically generated bindings from the C++ source seem promising. Much
more work is required to make it usable though.

On Mon, Jul 24, 2017 at 9:00 PM, Kevin Moore <ke...@quiltdata.io> wrote:

> A group of Quilt users and team members interested in R is planning a short
> call to get the ball rolling on R bindings for Arrow (and Quilt) tomorrow
> at 4PM Pacific. We'd love to have anyone who's interested from this list
> join us in the hangout:
> https://hangouts.google.com/hangouts/_/quiltdata.io/aneesh?authuser=1
>
> Thanks,
>
> Kevin
>
> ----
> Kevin Moore
> CEO, Quilt Data, Inc.
> kevin@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/>
> (415) 497-7895
>
>
> Manage Data like Code
> quiltdata.com
>
> On Mon, Jul 24, 2017 at 7:58 AM, Wes McKinney <we...@gmail.com> wrote:
>
> > + Hadley
> >
> > On Fri, Jul 21, 2017 at 2:04 PM, Bryan Cutler <cu...@gmail.com> wrote:
> > > Thanks Clark.  I know that SparkR would benefit a lot from Arrow
> bindings
> > > and many people would like to see that, but to my knowledge no one has
> > > started working on this yet.  Please keep us updated with what you
> find!
> > >
> > > Bryan
> > >
> > > On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <
> clarkfitzg@gmail.com>
> > > wrote:
> > >
> > >> Regarding the R Consortium, the Distributed Computing Working Group
> led
> > by
> > >> Michael Lawrence would be interested in this. It would be nice to go
> to
> > >> them with some working examples and use cases.
> > >>
> > >> Next week I will start looking into R / Arrow bindings. A couple other
> > >> people at the UC Davis Data Science Initiative have expressed interest
> > as
> > >> well. I'll post updates here.
> > >>
> > >> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <de...@dv01.co> wrote:
> > >>
> > >> > Sounds good, will get a thread going there.
> > >> >
> > >> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <we...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Especially with Arrow support landing in Spark (SPARK-13534), it
> > would
> > >> > > be helpful to combine efforts between Python and R on this front.
> I
> > >> > > also have a long list of improvements to the Feather format that
> > will
> > >> > > be substantially simpler once library(feather) is depending on the
> > >> > > main Arrow libraries.
> > >> > >
> > >> > > I suggest you reach out to members of the R community directly on
> > >> > > public forums about development help / advice and soliciting
> > >> > > collaboration. There are other R venues where you can describe
> your
> > >> > > use cases, like the R Consortium and its subcommittees:
> > >> > > https://www.r-consortium.org/. I would go directly to the mailing
> > >> > > lists and see if there is anyone who would like to get involved.
> > It's
> > >> > > more likely that you'll get attention on this problem in the R
> > mailing
> > >> > > lists than on the Arrow mailing list due to the chicken-and-egg
> > >> > > aspect.
> > >> > >
> > >> > > As a side note, my opinion is that shared storage, memory formats,
> > and
> > >> > > computing libraries (e.g. native C++ libraries targeting Arrow
> > memory)
> > >> > > are going to be more and more important to the R / Python / Julia
> > >> > > communities (and beyond -- Kou has been developing Arrow
> interfaces
> > >> > > for Ruby, which has not traditionally had a large data science
> > >> > > community) as time passes. I would like to personally do more on
> > the R
> > >> > > side but I simply don't have the bandwidth to take responsibility
> > for
> > >> > > another major component, especially not in an unfamiliar software
> > >> > > development stack.
> > >> > >
> > >> > > Let me know how I can help, and if there are R mailing list
> > >> > > discussions where we (the Arrow developers) can chime in please
> > alert
> > >> > > us to them here.
> > >> > >
> > >> > > - Wes
> > >> > >
> > >> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
> > >> > > > I also sent a note about it to the dev list a month ago. Still
> > have a
> > >> > > huge
> > >> > > > internal need and interested in helping push this along where we
> > can.
> > >> > > > Unfortunately, our team is more focused around Spark and doesn't
> > have
> > >> > > much
> > >> > > > experience working with the R community.
> > >> > > >
> > >> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
> > >> clarkfitzg@gmail.com
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > >> Hello all,
> > >> > > >>
> > >> > > >> I saw the notes come through from today's call:
> > >> > > >>
> > >> > > >> > * R Arrow Bindings?
> > >> > > >> >  - Find use cases within the R community, contributors needed
> > >> > > >> >  - R Feather bindings a useful starting point
> > >> > > >>
> > >> > > >> This year I've been working on parallel R on datasets in the
> > 100+ GB
> > >> > > range,
> > >> > > >> and have found that loading and saving data from text files is
> a
> > >> real
> > >> > > >> bottleneck. Another consideration is breaking the data up into
> > >> chunks
> > >> > > for
> > >> > > >> parallel processing while maintaining metadata and overall
> > >> structure.
> > >> > So
> > >> > > >> I've been watching Parquet and Arrow.
> > >> > > >>
> > >> > > >> Specifically here are two use cases in R where Arrow / Parquet
> > could
> > >> > be
> > >> > > >> helpful:
> > >> > > >>
> > >> > > >> - Splitting up a large data set into pieces which fit
> > comfortably in
> > >> > > memory
> > >> > > >> then applying normal R functions to each piece. Basically GROUP
> > BY.
> > >> > > >> - Matloff's Software Alchemy, statistical averaging based on
> > >> > independent
> > >> > > >> chunks of data. This requires rows to be randomly assigned to
> > >> chunks.
> > >> > > >>
> > >> > > >> Another option besides starting from the R Feather bindings is
> to
> > >> > start
> > >> > > >> with an automatically generated set of bindings:
> > >> > > >> https://github.com/duncantl/RCodeGen
> > >> > > >>
> > >> > > >> Best,
> > >> > > >> Clark Fitzgerald
> > >> > > >>
> > >> > > > --
> > >> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > >> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
> > >> > > > 915 Broadway | Suite 502 | New York, NY 10010
> > >> > > > (646)-838-2310 <(646)%20838-2310>
> > >> > > > dean@dv01.co | www.dv01.co
> > >> > >
> > >> > --
> > >> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > >> > <http://www.forbes.com/fintech/2016/#310668d56680>
> > >> > 915 Broadway | Suite 502 | New York, NY 10010
> > >> > (646)-838-2310
> > >> > dean@dv01.co | www.dv01.co
> > >> >
> > >>
> >
>

Re: Use case for R Arrow Bindings

Posted by Kevin Moore <ke...@quiltdata.io>.

A group of Quilt users and team members interested in R is planning a short
call to get the ball rolling on R bindings for Arrow (and Quilt) tomorrow
at 4PM Pacific. We'd love to have anyone who's interested from this list
join us in the hangout:
https://hangouts.google.com/hangouts/_/quiltdata.io/aneesh?authuser=1

Thanks,

Kevin

----
Kevin Moore
CEO, Quilt Data, Inc.
kevin@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/>
(415) 497-7895


Manage Data like Code
quiltdata.com

On Mon, Jul 24, 2017 at 7:58 AM, Wes McKinney <we...@gmail.com> wrote:

> + Hadley
>
> On Fri, Jul 21, 2017 at 2:04 PM, Bryan Cutler <cu...@gmail.com> wrote:
> > Thanks Clark.  I know that SparkR would benefit a lot from Arrow bindings
> > and many people would like to see that, but to my knowledge no one has
> > started working on this yet.  Please keep us updated with what you find!
> >
> > Bryan
> >
> > On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <cl...@gmail.com>
> > wrote:
> >
> >> Regarding the R Consortium, the Distributed Computing Working Group led
> by
> >> Michael Lawrence would be interested in this. It would be nice to go to
> >> them with some working examples and use cases.
> >>
> >> Next week I will start looking into R / Arrow bindings. A couple other
> >> people at the UC Davis Data Science Initiative have expressed interest
> as
> >> well. I'll post updates here.
> >>
> >> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <de...@dv01.co> wrote:
> >>
> >> > Sounds good, will get a thread going there.
> >> >
> >> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > Especially with Arrow support landing in Spark (SPARK-13534), it
> would
> >> > > be helpful to combine efforts between Python and R on this front. I
> >> > > also have a long list of improvements to the Feather format that
> will
> >> > > be substantially simpler once library(feather) is depending on the
> >> > > main Arrow libraries.
> >> > >
> >> > > I suggest you reach out to members of the R community directly on
> >> > > public forums about development help / advice and soliciting
> >> > > collaboration. There are other R venues where you can describe your
> >> > > use cases, like the R Consortium and its subcommittees:
> >> > > https://www.r-consortium.org/. I would go directly to the mailing
> >> > > lists and see if there is anyone who would like to get involved.
> It's
> >> > > more likely that you'll get attention on this problem in the R
> mailing
> >> > > lists than on the Arrow mailing list due to the chicken-and-egg
> >> > > aspect.
> >> > >
> >> > > As a side note, my opinion is that shared storage, memory formats,
> and
> >> > > computing libraries (e.g. native C++ libraries targeting Arrow
> memory)
> >> > > are going to be more and more important to the R / Python / Julia
> >> > > communities (and beyond -- Kou has been developing Arrow interfaces
> >> > > for Ruby, which has not traditionally had a large data science
> >> > > community) as time passes. I would like to personally do more on
> the R
> >> > > side but I simply don't have the bandwidth to take responsibility
> for
> >> > > another major component, especially not in an unfamiliar software
> >> > > development stack.
> >> > >
> >> > > Let me know how I can help, and if there are R mailing list
> >> > > discussions where we (the Arrow developers) can chime in please
> alert
> >> > > us to them here.
> >> > >
> >> > > - Wes
> >> > >
> >> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
> >> > > > I also sent a note about it to the dev list a month ago. Still
> have a
> >> > > huge
> >> > > > internal need and interested in helping push this along where we
> can.
> >> > > > Unfortunately, our team is more focused around Spark and doesn't
> have
> >> > > much
> >> > > > experience working with the R community.
> >> > > >
> >> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
> >> clarkfitzg@gmail.com
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > >> Hello all,
> >> > > >>
> >> > > >> I saw the notes come through from today's call:
> >> > > >>
> >> > > >> > * R Arrow Bindings?
> >> > > >> >  - Find use cases within the R community, contributors needed
> >> > > >> >  - R Feather bindings a useful starting point
> >> > > >>
> >> > > >> This year I've been working on parallel R on datasets in the
> 100+ GB
> >> > > range,
> >> > > >> and have found that loading and saving data from text files is a
> >> real
> >> > > >> bottleneck. Another consideration is breaking the data up into
> >> chunks
> >> > > for
> >> > > >> parallel processing while maintaining metadata and overall
> >> structure.
> >> > So
> >> > > >> I've been watching Parquet and Arrow.
> >> > > >>
> >> > > >> Specifically here are two use cases in R where Arrow / Parquet
> could
> >> > be
> >> > > >> helpful:
> >> > > >>
> >> > > >> - Splitting up a large data set into pieces which fit
> comfortably in
> >> > > memory
> >> > > >> then applying normal R functions to each piece. Basically GROUP
> BY.
> >> > > >> - Matloff's Software Alchemy, statistical averaging based on
> >> > independent
> >> > > >> chunks of data. This requires rows to be randomly assigned to
> >> chunks.
> >> > > >>
> >> > > >> Another option besides starting from the R Feather bindings is to
> >> > start
> >> > > >> with an automatically generated set of bindings:
> >> > > >> https://github.com/duncantl/RCodeGen
> >> > > >>
> >> > > >> Best,
> >> > > >> Clark Fitzgerald
> >> > > >>
> >> > > > --
> >> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> >> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
> >> > > > 915 Broadway | Suite 502 | New York, NY 10010
> >> > > > (646)-838-2310 <(646)%20838-2310>
> >> > > > dean@dv01.co | www.dv01.co
> >> > >
> >> > --
> >> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> >> > <http://www.forbes.com/fintech/2016/#310668d56680>
> >> > 915 Broadway | Suite 502 | New York, NY 10010
> >> > (646)-838-2310
> >> > dean@dv01.co | www.dv01.co
> >> >
> >>
>

Re: Use case for R Arrow Bindings

Posted by Wes McKinney <we...@gmail.com>.

+ Hadley

On Fri, Jul 21, 2017 at 2:04 PM, Bryan Cutler <cu...@gmail.com> wrote:
> Thanks Clark.  I know that SparkR would benefit a lot from Arrow bindings
> and many people would like to see that, but to my knowledge no one has
> started working on this yet.  Please keep us updated with what you find!
>
> Bryan
>
> On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <cl...@gmail.com>
> wrote:
>
>> Regarding the R Consortium, the Distributed Computing Working Group led by
>> Michael Lawrence would be interested in this. It would be nice to go to
>> them with some working examples and use cases.
>>
>> Next week I will start looking into R / Arrow bindings. A couple other
>> people at the UC Davis Data Science Initiative have expressed interest as
>> well. I'll post updates here.
>>
>> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <de...@dv01.co> wrote:
>>
>> > Sounds good, will get a thread going there.
>> >
>> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > Especially with Arrow support landing in Spark (SPARK-13534), it would
>> > > be helpful to combine efforts between Python and R on this front. I
>> > > also have a long list of improvements to the Feather format that will
>> > > be substantially simpler once library(feather) is depending on the
>> > > main Arrow libraries.
>> > >
>> > > I suggest you reach out to members of the R community directly on
>> > > public forums about development help / advice and soliciting
>> > > collaboration. There are other R venues where you can describe your
>> > > use cases, like the R Consortium and its subcommittees:
>> > > https://www.r-consortium.org/. I would go directly to the mailing
>> > > lists and see if there is anyone who would like to get involved. It's
>> > > more likely that you'll get attention on this problem in the R mailing
>> > > lists than on the Arrow mailing list due to the chicken-and-egg
>> > > aspect.
>> > >
>> > > As a side note, my opinion is that shared storage, memory formats, and
>> > > computing libraries (e.g. native C++ libraries targeting Arrow memory)
>> > > are going to be more and more important to the R / Python / Julia
>> > > communities (and beyond -- Kou has been developing Arrow interfaces
>> > > for Ruby, which has not traditionally had a large data science
>> > > community) as time passes. I would like to personally do more on the R
>> > > side but I simply don't have the bandwidth to take responsibility for
>> > > another major component, especially not in an unfamiliar software
>> > > development stack.
>> > >
>> > > Let me know how I can help, and if there are R mailing list
>> > > discussions where we (the Arrow developers) can chime in please alert
>> > > us to them here.
>> > >
>> > > - Wes
>> > >
>> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
>> > > > I also sent a note about it to the dev list a month ago. Still have a
>> > > huge
>> > > > internal need and interested in helping push this along where we can.
>> > > > Unfortunately, our team is more focused around Spark and doesn't have
>> > > much
>> > > > experience working with the R community.
>> > > >
>> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
>> clarkfitzg@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > >> Hello all,
>> > > >>
>> > > >> I saw the notes come through from today's call:
>> > > >>
>> > > >> > * R Arrow Bindings?
>> > > >> >  - Find use cases within the R community, contributors needed
>> > > >> >  - R Feather bindings a useful starting point
>> > > >>
>> > > >> This year I've been working on parallel R on datasets in the 100+ GB
>> > > range,
>> > > >> and have found that loading and saving data from text files is a
>> real
>> > > >> bottleneck. Another consideration is breaking the data up into
>> chunks
>> > > for
>> > > >> parallel processing while maintaining metadata and overall
>> structure.
>> > So
>> > > >> I've been watching Parquet and Arrow.
>> > > >>
>> > > >> Specifically here are two use cases in R where Arrow / Parquet could
>> > be
>> > > >> helpful:
>> > > >>
>> > > >> - Splitting up a large data set into pieces which fit comfortably in
>> > > memory
>> > > >> then applying normal R functions to each piece. Basically GROUP BY.
>> > > >> - Matloff's Software Alchemy, statistical averaging based on
>> > independent
>> > > >> chunks of data. This requires rows to be randomly assigned to
>> chunks.
>> > > >>
>> > > >> Another option besides starting from the R Feather bindings is to
>> > start
>> > > >> with an automatically generated set of bindings:
>> > > >> https://github.com/duncantl/RCodeGen
>> > > >>
>> > > >> Best,
>> > > >> Clark Fitzgerald
>> > > >>
>> > > > --
>> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
>> > > > 915 Broadway | Suite 502 | New York, NY 10010
>> > > > (646)-838-2310 <(646)%20838-2310>
>> > > > dean@dv01.co | www.dv01.co
>> > >
>> > --
>> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> > <http://www.forbes.com/fintech/2016/#310668d56680>
>> > 915 Broadway | Suite 502 | New York, NY 10010
>> > (646)-838-2310
>> > dean@dv01.co | www.dv01.co
>> >
>>

Re: Use case for R Arrow Bindings

Posted by Bryan Cutler <cu...@gmail.com>.

Thanks Clark.  I know that SparkR would benefit a lot from Arrow bindings
and many people would like to see that, but to my knowledge no one has
started working on this yet.  Please keep us updated with what you find!

Bryan

On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <cl...@gmail.com>
wrote:

> Regarding the R Consortium, the Distributed Computing Working Group led by
> Michael Lawrence would be interested in this. It would be nice to go to
> them with some working examples and use cases.
>
> Next week I will start looking into R / Arrow bindings. A couple other
> people at the UC Davis Data Science Initiative have expressed interest as
> well. I'll post updates here.
>
> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <de...@dv01.co> wrote:
>
> > Sounds good, will get a thread going there.
> >
> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Especially with Arrow support landing in Spark (SPARK-13534), it would
> > > be helpful to combine efforts between Python and R on this front. I
> > > also have a long list of improvements to the Feather format that will
> > > be substantially simpler once library(feather) is depending on the
> > > main Arrow libraries.
> > >
> > > I suggest you reach out to members of the R community directly on
> > > public forums about development help / advice and soliciting
> > > collaboration. There are other R venues where you can describe your
> > > use cases, like the R Consortium and its subcommittees:
> > > https://www.r-consortium.org/. I would go directly to the mailing
> > > lists and see if there is anyone who would like to get involved. It's
> > > more likely that you'll get attention on this problem in the R mailing
> > > lists than on the Arrow mailing list due to the chicken-and-egg
> > > aspect.
> > >
> > > As a side note, my opinion is that shared storage, memory formats, and
> > > computing libraries (e.g. native C++ libraries targeting Arrow memory)
> > > are going to be more and more important to the R / Python / Julia
> > > communities (and beyond -- Kou has been developing Arrow interfaces
> > > for Ruby, which has not traditionally had a large data science
> > > community) as time passes. I would like to personally do more on the R
> > > side but I simply don't have the bandwidth to take responsibility for
> > > another major component, especially not in an unfamiliar software
> > > development stack.
> > >
> > > Let me know how I can help, and if there are R mailing list
> > > discussions where we (the Arrow developers) can chime in please alert
> > > us to them here.
> > >
> > > - Wes
> > >
> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
> > > > I also sent a note about it to the dev list a month ago. Still have a
> > > huge
> > > > internal need and interested in helping push this along where we can.
> > > > Unfortunately, our team is more focused around Spark and doesn't have
> > > much
> > > > experience working with the R community.
> > > >
> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
> clarkfitzg@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Hello all,
> > > >>
> > > >> I saw the notes come through from today's call:
> > > >>
> > > >> > * R Arrow Bindings?
> > > >> >  - Find use cases within the R community, contributors needed
> > > >> >  - R Feather bindings a useful starting point
> > > >>
> > > >> This year I've been working on parallel R on datasets in the 100+ GB
> > > range,
> > > >> and have found that loading and saving data from text files is a
> real
> > > >> bottleneck. Another consideration is breaking the data up into
> chunks
> > > for
> > > >> parallel processing while maintaining metadata and overall
> structure.
> > So
> > > >> I've been watching Parquet and Arrow.
> > > >>
> > > >> Specifically here are two use cases in R where Arrow / Parquet could
> > be
> > > >> helpful:
> > > >>
> > > >> - Splitting up a large data set into pieces which fit comfortably in
> > > memory
> > > >> then applying normal R functions to each piece. Basically GROUP BY.
> > > >> - Matloff's Software Alchemy, statistical averaging based on
> > independent
> > > >> chunks of data. This requires rows to be randomly assigned to
> chunks.
> > > >>
> > > >> Another option besides starting from the R Feather bindings is to
> > start
> > > >> with an automatically generated set of bindings:
> > > >> https://github.com/duncantl/RCodeGen
> > > >>
> > > >> Best,
> > > >> Clark Fitzgerald
> > > >>
> > > > --
> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
> > > > 915 Broadway | Suite 502 | New York, NY 10010
> > > > (646)-838-2310 <(646)%20838-2310>
> > > > dean@dv01.co | www.dv01.co
> > >
> > --
> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > <http://www.forbes.com/fintech/2016/#310668d56680>
> > 915 Broadway | Suite 502 | New York, NY 10010
> > (646)-838-2310
> > dean@dv01.co | www.dv01.co
> >
>

Re: Use case for R Arrow Bindings

Posted by Clark Fitzgerald <cl...@gmail.com>.

Regarding the R Consortium, the Distributed Computing Working Group led by
Michael Lawrence would be interested in this. It would be nice to go to
them with some working examples and use cases.

Next week I will start looking into R / Arrow bindings. A couple other
people at the UC Davis Data Science Initiative have expressed interest as
well. I'll post updates here.

On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <de...@dv01.co> wrote:

> Sounds good, will get a thread going there.
>
> On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Especially with Arrow support landing in Spark (SPARK-13534), it would
> > be helpful to combine efforts between Python and R on this front. I
> > also have a long list of improvements to the Feather format that will
> > be substantially simpler once library(feather) is depending on the
> > main Arrow libraries.
> >
> > I suggest you reach out to members of the R community directly on
> > public forums about development help / advice and soliciting
> > collaboration. There are other R venues where you can describe your
> > use cases, like the R Consortium and its subcommittees:
> > https://www.r-consortium.org/. I would go directly to the mailing
> > lists and see if there is anyone who would like to get involved. It's
> > more likely that you'll get attention on this problem in the R mailing
> > lists than on the Arrow mailing list due to the chicken-and-egg
> > aspect.
> >
> > As a side note, my opinion is that shared storage, memory formats, and
> > computing libraries (e.g. native C++ libraries targeting Arrow memory)
> > are going to be more and more important to the R / Python / Julia
> > communities (and beyond -- Kou has been developing Arrow interfaces
> > for Ruby, which has not traditionally had a large data science
> > community) as time passes. I would like to personally do more on the R
> > side but I simply don't have the bandwidth to take responsibility for
> > another major component, especially not in an unfamiliar software
> > development stack.
> >
> > Let me know how I can help, and if there are R mailing list
> > discussions where we (the Arrow developers) can chime in please alert
> > us to them here.
> >
> > - Wes
> >
> > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
> > > I also sent a note about it to the dev list a month ago. Still have a
> > huge
> > > internal need and interested in helping push this along where we can.
> > > Unfortunately, our team is more focused around Spark and doesn't have
> > much
> > > experience working with the R community.
> > >
> > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <clarkfitzg@gmail.com
> >
> > > wrote:
> > >
> > >> Hello all,
> > >>
> > >> I saw the notes come through from today's call:
> > >>
> > >> > * R Arrow Bindings?
> > >> >  - Find use cases within the R community, contributors needed
> > >> >  - R Feather bindings a useful starting point
> > >>
> > >> This year I've been working on parallel R on datasets in the 100+ GB
> > range,
> > >> and have found that loading and saving data from text files is a real
> > >> bottleneck. Another consideration is breaking the data up into chunks
> > for
> > >> parallel processing while maintaining metadata and overall structure.
> So
> > >> I've been watching Parquet and Arrow.
> > >>
> > >> Specifically here are two use cases in R where Arrow / Parquet could
> be
> > >> helpful:
> > >>
> > >> - Splitting up a large data set into pieces which fit comfortably in
> > memory
> > >> then applying normal R functions to each piece. Basically GROUP BY.
> > >> - Matloff's Software Alchemy, statistical averaging based on
> independent
> > >> chunks of data. This requires rows to be randomly assigned to chunks.
> > >>
> > >> Another option besides starting from the R Feather bindings is to
> start
> > >> with an automatically generated set of bindings:
> > >> https://github.com/duncantl/RCodeGen
> > >>
> > >> Best,
> > >> Clark Fitzgerald
> > >>
> > > --
> > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > > <http://www.forbes.com/fintech/2016/#310668d56680>
> > > 915 Broadway | Suite 502 | New York, NY 10010
> > > (646)-838-2310 <(646)%20838-2310>
> > > dean@dv01.co | www.dv01.co
> >
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> dean@dv01.co | www.dv01.co
>

Re: Use case for R Arrow Bindings

Posted by Dean Chen <de...@dv01.co>.

Sounds good, will get a thread going there.

On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <we...@gmail.com> wrote:

> Especially with Arrow support landing in Spark (SPARK-13534), it would
> be helpful to combine efforts between Python and R on this front. I
> also have a long list of improvements to the Feather format that will
> be substantially simpler once library(feather) is depending on the
> main Arrow libraries.
>
> I suggest you reach out to members of the R community directly on
> public forums about development help / advice and soliciting
> collaboration. There are other R venues where you can describe your
> use cases, like the R Consortium and its subcommittees:
> https://www.r-consortium.org/. I would go directly to the mailing
> lists and see if there is anyone who would like to get involved. It's
> more likely that you'll get attention on this problem in the R mailing
> lists than on the Arrow mailing list due to the chicken-and-egg
> aspect.
>
> As a side note, my opinion is that shared storage, memory formats, and
> computing libraries (e.g. native C++ libraries targeting Arrow memory)
> are going to be more and more important to the R / Python / Julia
> communities (and beyond -- Kou has been developing Arrow interfaces
> for Ruby, which has not traditionally had a large data science
> community) as time passes. I would like to personally do more on the R
> side but I simply don't have the bandwidth to take responsibility for
> another major component, especially not in an unfamiliar software
> development stack.
>
> Let me know how I can help, and if there are R mailing list
> discussions where we (the Arrow developers) can chime in please alert
> us to them here.
>
> - Wes
>
> On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
> > I also sent a note about it to the dev list a month ago. Still have a
> huge
> > internal need and interested in helping push this along where we can.
> > Unfortunately, our team is more focused around Spark and doesn't have
> much
> > experience working with the R community.
> >
> > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <cl...@gmail.com>
> > wrote:
> >
> >> Hello all,
> >>
> >> I saw the notes come through from today's call:
> >>
> >> > * R Arrow Bindings?
> >> >  - Find use cases within the R community, contributors needed
> >> >  - R Feather bindings a useful starting point
> >>
> >> This year I've been working on parallel R on datasets in the 100+ GB
> range,
> >> and have found that loading and saving data from text files is a real
> >> bottleneck. Another consideration is breaking the data up into chunks
> for
> >> parallel processing while maintaining metadata and overall structure. So
> >> I've been watching Parquet and Arrow.
> >>
> >> Specifically here are two use cases in R where Arrow / Parquet could be
> >> helpful:
> >>
> >> - Splitting up a large data set into pieces which fit comfortably in
> memory
> >> then applying normal R functions to each piece. Basically GROUP BY.
> >> - Matloff's Software Alchemy, statistical averaging based on independent
> >> chunks of data. This requires rows to be randomly assigned to chunks.
> >>
> >> Another option besides starting from the R Feather bindings is to start
> >> with an automatically generated set of bindings:
> >> https://github.com/duncantl/RCodeGen
> >>
> >> Best,
> >> Clark Fitzgerald
> >>
> > --
> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > <http://www.forbes.com/fintech/2016/#310668d56680>
> > 915 Broadway | Suite 502 | New York, NY 10010
> > (646)-838-2310 <(646)%20838-2310>
> > dean@dv01.co | www.dv01.co
>
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
dean@dv01.co | www.dv01.co

Re: Use case for R Arrow Bindings

Posted by Wes McKinney <we...@gmail.com>.

Especially with Arrow support landing in Spark (SPARK-13534), it would
be helpful to combine efforts between Python and R on this front. I
also have a long list of improvements to the Feather format that will
be substantially simpler once library(feather) is depending on the
main Arrow libraries.

I suggest you reach out to members of the R community directly on
public forums about development help / advice and soliciting
collaboration. There are other R venues where you can describe your
use cases, like the R Consortium and its subcommittees:
https://www.r-consortium.org/. I would go directly to the mailing
lists and see if there is anyone who would like to get involved. It's
more likely that you'll get attention on this problem in the R mailing
lists than on the Arrow mailing list due to the chicken-and-egg
aspect.

As a side note, my opinion is that shared storage, memory formats, and
computing libraries (e.g. native C++ libraries targeting Arrow memory)
are going to be more and more important to the R / Python / Julia
communities (and beyond -- Kou has been developing Arrow interfaces
for Ruby, which has not traditionally had a large data science
community) as time passes. I would like to personally do more on the R
side but I simply don't have the bandwidth to take responsibility for
another major component, especially not in an unfamiliar software
development stack.

Let me know how I can help, and if there are R mailing list
discussions where we (the Arrow developers) can chime in please alert
us to them here.

- Wes

On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <de...@dv01.co> wrote:
> I also sent a note about it to the dev list a month ago. Still have a huge
> internal need and interested in helping push this along where we can.
> Unfortunately, our team is more focused around Spark and doesn't have much
> experience working with the R community.
>
> On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <cl...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> I saw the notes come through from today's call:
>>
>> > * R Arrow Bindings?
>> >  - Find use cases within the R community, contributors needed
>> >  - R Feather bindings a useful starting point
>>
>> This year I've been working on parallel R on datasets in the 100+ GB range,
>> and have found that loading and saving data from text files is a real
>> bottleneck. Another consideration is breaking the data up into chunks for
>> parallel processing while maintaining metadata and overall structure. So
>> I've been watching Parquet and Arrow.
>>
>> Specifically here are two use cases in R where Arrow / Parquet could be
>> helpful:
>>
>> - Splitting up a large data set into pieces which fit comfortably in memory
>> then applying normal R functions to each piece. Basically GROUP BY.
>> - Matloff's Software Alchemy, statistical averaging based on independent
>> chunks of data. This requires rows to be randomly assigned to chunks.
>>
>> Another option besides starting from the R Feather bindings is to start
>> with an automatically generated set of bindings:
>> https://github.com/duncantl/RCodeGen
>>
>> Best,
>> Clark Fitzgerald
>>
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> dean@dv01.co | www.dv01.co

Re: Use case for R Arrow Bindings

Posted by Dean Chen <de...@dv01.co>.

I also sent a note about it to the dev list a month ago. Still have a huge
internal need and interested in helping push this along where we can.
Unfortunately, our team is more focused around Spark and doesn't have much
experience working with the R community.

On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <cl...@gmail.com>
wrote:

> Hello all,
>
> I saw the notes come through from today's call:
>
> > * R Arrow Bindings?
> >  - Find use cases within the R community, contributors needed
> >  - R Feather bindings a useful starting point
>
> This year I've been working on parallel R on datasets in the 100+ GB range,
> and have found that loading and saving data from text files is a real
> bottleneck. Another consideration is breaking the data up into chunks for
> parallel processing while maintaining metadata and overall structure. So
> I've been watching Parquet and Arrow.
>
> Specifically here are two use cases in R where Arrow / Parquet could be
> helpful:
>
> - Splitting up a large data set into pieces which fit comfortably in memory
> then applying normal R functions to each piece. Basically GROUP BY.
> - Matloff's Software Alchemy, statistical averaging based on independent
> chunks of data. This requires rows to be randomly assigned to chunks.
>
> Another option besides starting from the R Feather bindings is to start
> with an automatically generated set of bindings:
> https://github.com/duncantl/RCodeGen
>
> Best,
> Clark Fitzgerald
>
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
dean@dv01.co | www.dv01.co