You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2018/12/04 03:03:06 UTC

Re: Arrow and R benchmark

hi Jonathan,
On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com> wrote:
>
> Hi Wes and Romain,
>
> I wrote a preliminary benchmark for reading and writing different file types from R into arrow, borrowed some code from Hadley. I would like some feedback to improve it and then possible push a R/benchmarks folder. I am willing to dedicate most of next week to this project, as I am taking a vacation from work, but would like to contribute to Arrow and R.
>
> To Romain: What is the difference in R when using tibble versus reading from arrow?
> Is the general advantage that you can serialize the data to arrow when saving it? Then be able to call it in Python with arrow then pandas?

Arrow has a language-independent binary protocol for data interchange
that does not require deserialization of data on read. It can be read
or written in many different ways: files, sockets, shared memory, etc.
How it gets used depends on the application

>
> General Roadmap Question to Wes and Romain :
> My vision for the future of data science, is the ability to serialize data securely and pass data and models securely with some form of authentication between IDEs with secure ports. This idea would develop with something similar to gRPC, with more security designed with sharing data. I noticed flight gRpc.
>

Correct, our plan for RPC is to use gRPC for secure transport of
components of the Arrow columnar protocol. We'd love to have more
developers involved with this effort.

> Also, I was interested if there was any momentum in  the R community to serialize models similar to the work of Onnx into a unified model storage system. The idea is to have a secure reproducible environment for R and Python developer groups to readily share models and data, with the caveat that data sent also has added security and possibly a history associated with it for security. This piece of work, is something I am passionate in seeing come to fruition. And would like to explore options for this actualization.
>

Here we are focused on efficient handling and processing of datasets.
These tools could be used to build a model storage system if so
desired.

> The background for me is to enable HealthCare teams to share medical data securely among different analytics teams. The security provisions would enable more robust cloud based storage and computation in a secure fashion.
>

I would like to see deeper integration with cloud storage services in
2019 in the core C++ libraries, which would be made available in R,
Python, Ruby, etc.

- Wes

> Thanks,
> Jonathan
>
>
>
> Side Note:
> Building arrow for R on Linux was a big hassle relative to mac. Was unable to build on linux.
>
>
>
>
> On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com> wrote:
>>
>> I'll go through that python repo and see what I can do.
>>
>> Thanks,
>> Jonathan
>>
>> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I would suggest starting an r/benchmarks directory like we have in
>>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks)
>>> and documenting the process for running all the benchmarks.
>>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat> wrote:
>>> >
>>> > Right now, most of the code examples is in the unit tests, but this is not measuring performance or stressing it. Perhaps you can start from there ?
>>> >
>>> > Romain
>>> >
>>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a écrit :
>>> > >
>>> > > Adding dev@arrow.apache.org
>>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <ch...@gmail.com> wrote:
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I would like to contribute to developing benchmark suites for R and Arrow? What would be the best way to start?
>>> > >>
>>> > >> Thanks,
>>> > >> Jonathan
>>> >

Re: Arrow and R benchmark

Posted by Jonathan Chiang <ch...@gmail.com>.

Hi All,

The thread about Apache Arrow and Google Cloud support did get some
traction! Thanks to Micah for his suggestions.

*If everyone can STAR this link, we could get more visibility. *I'm
guessing if Wes responds to the thread it would be a huge win.

*https://issuetracker.google.com/issues/124858094
<https://issuetracker.google.com/issues/124858094>*

Thanks,
Jonathan

On Tue, Feb 26, 2019 at 7:55 PM Wes McKinney <we...@gmail.com> wrote:

> Thanks Micah for the update.
>
> The continued investment in Apache Avro is interesting given the
> low-activity state of that community. I'm optimistic that BQ will
> offer native Arrow export at some point in the future, perhaps after
> we reach a "1.0.0" release
>
> - Wes
>
>
> On Sat, Feb 23, 2019 at 12:17 PM Jonathan Chiang <ch...@gmail.com>
> wrote:
> >
> > Hi Micah,
> >
> > Yes I filed the feature request from your advice. I will look more into
> avro for my own bigquery use cases. Thanks for following up.
> >
> > Best,
> > Jonathan
> >
> > On Feb 22, 2019, at 8:35 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Just to follow up on this thread, a new high throughput API [1] for
> reading data out of big query was released to public beta today.  The
> format it streams is AVRO so it should be higher performance then parsing
> JSON (and reads can be parallelized).  Implementing AVRO reading was
> something I was going to start working on in the next week or so, and I'll
> probably continue on to add support to arrow C++ for the new API (I will be
> creating JIRAs soon).  Given my current bandwidth (I contribute to arrow on
> my free time), this will take a while.  So if people are interested in
> collaborating (or taking this over) please let me know.
> >
> > Also, it looks like someone took my advice and filed a feature request
> [2] for surfacing apache arrow natively.
> >
> > Thanks,
> > Micah
> >
> > [1] https://cloud.google.com/bigquery/docs/reference/storage/
> > [2] https://issuetracker.google.com/issues/124858094
> >
> > On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> Would someone like to make some feature requests to Google or engage
> >> with them in another way? I have interacted with GCP in the past; I
> >> think it would be helpful for them to hear from other Arrow users or
> >> community members since I have been quite public as a carrier of the
> >> Arrow banner.
> >>
> >> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >> >
> >> > Disclaimer: I work for Google (not on BQ).  Everything I'm going to
> write
> >> > reflects my own opinions, not those of my company.
> >> >
> >> > Jonathan and Wes,
> >> >
> >> > One way of trying to get support for this is filing a feature request
> at
> >> > [1] and getting broader customer support for it.  Another possible
> way of
> >> > gaining broader exposure within Google is collaborating with other
> open
> >> > source projects that it contributes to.  For instance there was a
> >> > conversation recently about the potential use of Arrow on the Apache
> Beam
> >> > mailing list [2].  I will try to post a link to this thread
> internally, but
> >> > I can't make any promises and likely not give any updates on progress.
> >> >
> >> > This is also very much my own opinion, but I think in order to expose
> Arrow
> >> > in a public API it would be nice to reach a stable major release (i.e.
> >> > 1.0.0) and ensure Arrow properly supports big query data-types
> >> > appropriately [3], (I think it mostly does but date/time might be an
> issue).
> >> >
> >> > [1]
> >> >
> https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
> >> > [2]
> >> >
> https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
> >> > [3]
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> >> >
> >> >
> >> > On Monday, February 4, 2019, Wes McKinney <we...@gmail.com>
> wrote:
> >> >
> >> > > Arrow support would be an obvious win for BigQuery. I've spoken with
> >> > > people at Google Cloud about this in several occasions.
> >> > >
> >> > > With the gRPC / Flight work coming along it might be a good
> >> > > opportunity to rekindle the discussion. If anyone from GCP is
> reading
> >> > > or if you know anyone at GCP who might be able to work with us I
> would
> >> > > be very interested.
> >> > >
> >> > > One hurdle for BigQuery is that my understanding is that Google has
> >> > > policies in place that make it more difficult to take on external
> >> > > library dependencies in a sensitive system like Dremel / BigQuery.
> So
> >> > > someone from Google might have to develop an in-house Arrow
> >> > > implementation sufficient to send Arrow datasets from BigQuery to
> >> > > clients. The scope of that project is small enough (requiring only
> >> > > Flatbuffers as a dependency) that a motivated C or C++ developer at
> >> > > Google ought to be able to get it done in a month or two of focused
> >> > > work.
> >> > >
> >> > > - Wes
> >> > >
> >> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <chiang810@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > > Hi Wes,
> >> > > >
> >> > > > I am currently working a lot with Google BigQuery in R and Python.
> >> > > Hadley Wickham listed this as a big bottleneck for his library
> bigrquery.
> >> > > >
> >> > > > The bottleneck for loading BigQuery data is now parsing
> BigQuery’s JSON
> >> > > format, which is difficult to optimise further because I’m already
> using
> >> > > the fastest C++ JSON parser, RapidJson. If this is still too slow
> (because
> >> > > you download a lot of data), see ?bq_table_download for an
> alternative
> >> > > approach.
> >> > > >
> >> > > > Is there any momentum for Arrow to partner with Google here?
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jonathan
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com>
> wrote:
> >> > > >>
> >> > > >> hi Jonathan,
> >> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <
> chiang810@gmail.com>
> >> > > wrote:
> >> > > >> >
> >> > > >> > Hi Wes and Romain,
> >> > > >> >
> >> > > >> > I wrote a preliminary benchmark for reading and writing
> different
> >> > > file types from R into arrow, borrowed some code from Hadley. I
> would like
> >> > > some feedback to improve it and then possible push a R/benchmarks
> folder. I
> >> > > am willing to dedicate most of next week to this project, as I am
> taking a
> >> > > vacation from work, but would like to contribute to Arrow and R.
> >> > > >> >
> >> > > >> > To Romain: What is the difference in R when using tibble versus
> >> > > reading from arrow?
> >> > > >> > Is the general advantage that you can serialize the data to
> arrow
> >> > > when saving it? Then be able to call it in Python with arrow then
> pandas?
> >> > > >>
> >> > > >> Arrow has a language-independent binary protocol for data
> interchange
> >> > > >> that does not require deserialization of data on read. It can be
> read
> >> > > >> or written in many different ways: files, sockets, shared
> memory, etc.
> >> > > >> How it gets used depends on the application
> >> > > >>
> >> > > >> >
> >> > > >> > General Roadmap Question to Wes and Romain :
> >> > > >> > My vision for the future of data science, is the ability to
> serialize
> >> > > data securely and pass data and models securely with some form of
> >> > > authentication between IDEs with secure ports. This idea would
> develop with
> >> > > something similar to gRPC, with more security designed with sharing
> data. I
> >> > > noticed flight gRpc.
> >> > > >> >
> >> > > >>
> >> > > >> Correct, our plan for RPC is to use gRPC for secure transport of
> >> > > >> components of the Arrow columnar protocol. We'd love to have more
> >> > > >> developers involved with this effort.
> >> > > >>
> >> > > >> > Also, I was interested if there was any momentum in  the R
> community
> >> > > to serialize models similar to the work of Onnx into a unified model
> >> > > storage system. The idea is to have a secure reproducible
> environment for R
> >> > > and Python developer groups to readily share models and data, with
> the
> >> > > caveat that data sent also has added security and possibly a history
> >> > > associated with it for security. This piece of work, is something I
> am
> >> > > passionate in seeing come to fruition. And would like to explore
> options
> >> > > for this actualization.
> >> > > >> >
> >> > > >>
> >> > > >> Here we are focused on efficient handling and processing of
> datasets.
> >> > > >> These tools could be used to build a model storage system if so
> >> > > >> desired.
> >> > > >>
> >> > > >> > The background for me is to enable HealthCare teams to share
> medical
> >> > > data securely among different analytics teams. The security
> provisions
> >> > > would enable more robust cloud based storage and computation in a
> secure
> >> > > fashion.
> >> > > >> >
> >> > > >>
> >> > > >> I would like to see deeper integration with cloud storage
> services in
> >> > > >> 2019 in the core C++ libraries, which would be made available in
> R,
> >> > > >> Python, Ruby, etc.
> >> > > >>
> >> > > >> - Wes
> >> > > >>
> >> > > >> > Thanks,
> >> > > >> > Jonathan
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > Side Note:
> >> > > >> > Building arrow for R on Linux was a big hassle relative to
> mac. Was
> >> > > unable to build on linux.
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <
> chiang810@gmail.com>
> >> > > wrote:
> >> > > >> >>
> >> > > >> >> I'll go through that python repo and see what I can do.
> >> > > >> >>
> >> > > >> >> Thanks,
> >> > > >> >> Jonathan
> >> > > >> >>
> >> > > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <
> wesmckinn@gmail.com>
> >> > > wrote:
> >> > > >> >>>
> >> > > >> >>> I would suggest starting an r/benchmarks directory like we
> have in
> >> > > >> >>> Python (
> >> > > https://github.com/apache/arrow/tree/master/python/benchmarks)
> >> > > >> >>> and documenting the process for running all the benchmarks.
> >> > > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <
> romain@purrple.cat>
> >> > > wrote:
> >> > > >> >>> >
> >> > > >> >>> > Right now, most of the code examples is in the unit tests,
> but
> >> > > this is not measuring performance or stressing it. Perhaps you can
> start
> >> > > from there ?
> >> > > >> >>> >
> >> > > >> >>> > Romain
> >> > > >> >>> >
> >> > > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <
> wesmckinn@gmail.com> a
> >> > > écrit :
> >> > > >> >>> > >
> >> > > >> >>> > > Adding dev@arrow.apache.org
> >> > > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> >> > > chiang810@gmail.com> wrote:
> >> > > >> >>> > >>
> >> > > >> >>> > >> Hi,
> >> > > >> >>> > >>
> >> > > >> >>> > >> I would like to contribute to developing benchmark
> suites for
> >> > > R and Arrow? What would be the best way to start?
> >> > > >> >>> > >>
> >> > > >> >>> > >> Thanks,
> >> > > >> >>> > >> Jonathan
> >> > > >> >>> >
> >> > >
>

Re: Arrow and R benchmark

Posted by Wes McKinney <we...@gmail.com>.

Thanks Micah for the update.

The continued investment in Apache Avro is interesting given the
low-activity state of that community. I'm optimistic that BQ will
offer native Arrow export at some point in the future, perhaps after
we reach a "1.0.0" release

- Wes


On Sat, Feb 23, 2019 at 12:17 PM Jonathan Chiang <ch...@gmail.com> wrote:
>
> Hi Micah,
>
> Yes I filed the feature request from your advice. I will look more into avro for my own bigquery use cases. Thanks for following up.
>
> Best,
> Jonathan
>
> On Feb 22, 2019, at 8:35 PM, Micah Kornfield <em...@gmail.com> wrote:
>
> Just to follow up on this thread, a new high throughput API [1] for reading data out of big query was released to public beta today.  The format it streams is AVRO so it should be higher performance then parsing JSON (and reads can be parallelized).  Implementing AVRO reading was something I was going to start working on in the next week or so, and I'll probably continue on to add support to arrow C++ for the new API (I will be creating JIRAs soon).  Given my current bandwidth (I contribute to arrow on my free time), this will take a while.  So if people are interested in collaborating (or taking this over) please let me know.
>
> Also, it looks like someone took my advice and filed a feature request [2] for surfacing apache arrow natively.
>
> Thanks,
> Micah
>
> [1] https://cloud.google.com/bigquery/docs/reference/storage/
> [2] https://issuetracker.google.com/issues/124858094
>
> On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> Would someone like to make some feature requests to Google or engage
>> with them in another way? I have interacted with GCP in the past; I
>> think it would be helpful for them to hear from other Arrow users or
>> community members since I have been quite public as a carrier of the
>> Arrow banner.
>>
>> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
>> > reflects my own opinions, not those of my company.
>> >
>> > Jonathan and Wes,
>> >
>> > One way of trying to get support for this is filing a feature request at
>> > [1] and getting broader customer support for it.  Another possible way of
>> > gaining broader exposure within Google is collaborating with other open
>> > source projects that it contributes to.  For instance there was a
>> > conversation recently about the potential use of Arrow on the Apache Beam
>> > mailing list [2].  I will try to post a link to this thread internally, but
>> > I can't make any promises and likely not give any updates on progress.
>> >
>> > This is also very much my own opinion, but I think in order to expose Arrow
>> > in a public API it would be nice to reach a stable major release (i.e.
>> > 1.0.0) and ensure Arrow properly supports big query data-types
>> > appropriately [3], (I think it mostly does but date/time might be an issue).
>> >
>> > [1]
>> > https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
>> > [2]
>> > https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
>> > [3] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
>> >
>> >
>> > On Monday, February 4, 2019, Wes McKinney <we...@gmail.com> wrote:
>> >
>> > > Arrow support would be an obvious win for BigQuery. I've spoken with
>> > > people at Google Cloud about this in several occasions.
>> > >
>> > > With the gRPC / Flight work coming along it might be a good
>> > > opportunity to rekindle the discussion. If anyone from GCP is reading
>> > > or if you know anyone at GCP who might be able to work with us I would
>> > > be very interested.
>> > >
>> > > One hurdle for BigQuery is that my understanding is that Google has
>> > > policies in place that make it more difficult to take on external
>> > > library dependencies in a sensitive system like Dremel / BigQuery. So
>> > > someone from Google might have to develop an in-house Arrow
>> > > implementation sufficient to send Arrow datasets from BigQuery to
>> > > clients. The scope of that project is small enough (requiring only
>> > > Flatbuffers as a dependency) that a motivated C or C++ developer at
>> > > Google ought to be able to get it done in a month or two of focused
>> > > work.
>> > >
>> > > - Wes
>> > >
>> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <ch...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Hi Wes,
>> > > >
>> > > > I am currently working a lot with Google BigQuery in R and Python.
>> > > Hadley Wickham listed this as a big bottleneck for his library bigrquery.
>> > > >
>> > > > The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
>> > > format, which is difficult to optimise further because I’m already using
>> > > the fastest C++ JSON parser, RapidJson. If this is still too slow (because
>> > > you download a lot of data), see ?bq_table_download for an alternative
>> > > approach.
>> > > >
>> > > > Is there any momentum for Arrow to partner with Google here?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jonathan
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com> wrote:
>> > > >>
>> > > >> hi Jonathan,
>> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> > Hi Wes and Romain,
>> > > >> >
>> > > >> > I wrote a preliminary benchmark for reading and writing different
>> > > file types from R into arrow, borrowed some code from Hadley. I would like
>> > > some feedback to improve it and then possible push a R/benchmarks folder. I
>> > > am willing to dedicate most of next week to this project, as I am taking a
>> > > vacation from work, but would like to contribute to Arrow and R.
>> > > >> >
>> > > >> > To Romain: What is the difference in R when using tibble versus
>> > > reading from arrow?
>> > > >> > Is the general advantage that you can serialize the data to arrow
>> > > when saving it? Then be able to call it in Python with arrow then pandas?
>> > > >>
>> > > >> Arrow has a language-independent binary protocol for data interchange
>> > > >> that does not require deserialization of data on read. It can be read
>> > > >> or written in many different ways: files, sockets, shared memory, etc.
>> > > >> How it gets used depends on the application
>> > > >>
>> > > >> >
>> > > >> > General Roadmap Question to Wes and Romain :
>> > > >> > My vision for the future of data science, is the ability to serialize
>> > > data securely and pass data and models securely with some form of
>> > > authentication between IDEs with secure ports. This idea would develop with
>> > > something similar to gRPC, with more security designed with sharing data. I
>> > > noticed flight gRpc.
>> > > >> >
>> > > >>
>> > > >> Correct, our plan for RPC is to use gRPC for secure transport of
>> > > >> components of the Arrow columnar protocol. We'd love to have more
>> > > >> developers involved with this effort.
>> > > >>
>> > > >> > Also, I was interested if there was any momentum in  the R community
>> > > to serialize models similar to the work of Onnx into a unified model
>> > > storage system. The idea is to have a secure reproducible environment for R
>> > > and Python developer groups to readily share models and data, with the
>> > > caveat that data sent also has added security and possibly a history
>> > > associated with it for security. This piece of work, is something I am
>> > > passionate in seeing come to fruition. And would like to explore options
>> > > for this actualization.
>> > > >> >
>> > > >>
>> > > >> Here we are focused on efficient handling and processing of datasets.
>> > > >> These tools could be used to build a model storage system if so
>> > > >> desired.
>> > > >>
>> > > >> > The background for me is to enable HealthCare teams to share medical
>> > > data securely among different analytics teams. The security provisions
>> > > would enable more robust cloud based storage and computation in a secure
>> > > fashion.
>> > > >> >
>> > > >>
>> > > >> I would like to see deeper integration with cloud storage services in
>> > > >> 2019 in the core C++ libraries, which would be made available in R,
>> > > >> Python, Ruby, etc.
>> > > >>
>> > > >> - Wes
>> > > >>
>> > > >> > Thanks,
>> > > >> > Jonathan
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > Side Note:
>> > > >> > Building arrow for R on Linux was a big hassle relative to mac. Was
>> > > unable to build on linux.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com>
>> > > wrote:
>> > > >> >>
>> > > >> >> I'll go through that python repo and see what I can do.
>> > > >> >>
>> > > >> >> Thanks,
>> > > >> >> Jonathan
>> > > >> >>
>> > > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > >> >>>
>> > > >> >>> I would suggest starting an r/benchmarks directory like we have in
>> > > >> >>> Python (
>> > > https://github.com/apache/arrow/tree/master/python/benchmarks)
>> > > >> >>> and documenting the process for running all the benchmarks.
>> > > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat>
>> > > wrote:
>> > > >> >>> >
>> > > >> >>> > Right now, most of the code examples is in the unit tests, but
>> > > this is not measuring performance or stressing it. Perhaps you can start
>> > > from there ?
>> > > >> >>> >
>> > > >> >>> > Romain
>> > > >> >>> >
>> > > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a
>> > > écrit :
>> > > >> >>> > >
>> > > >> >>> > > Adding dev@arrow.apache.org
>> > > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
>> > > chiang810@gmail.com> wrote:
>> > > >> >>> > >>
>> > > >> >>> > >> Hi,
>> > > >> >>> > >>
>> > > >> >>> > >> I would like to contribute to developing benchmark suites for
>> > > R and Arrow? What would be the best way to start?
>> > > >> >>> > >>
>> > > >> >>> > >> Thanks,
>> > > >> >>> > >> Jonathan
>> > > >> >>> >
>> > >

Re: Arrow and R benchmark

Posted by Jonathan Chiang <ch...@gmail.com>.

Hi Micah,

Yes I filed the feature request from your advice. I will look more into avro for my own bigquery use cases. Thanks for following up. 

Best,
Jonathan 

> On Feb 22, 2019, at 8:35 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> Just to follow up on this thread, a new high throughput API [1] for reading data out of big query was released to public beta today.  The format it streams is AVRO so it should be higher performance then parsing JSON (and reads can be parallelized).  Implementing AVRO reading was something I was going to start working on in the next week or so, and I'll probably continue on to add support to arrow C++ for the new API (I will be creating JIRAs soon).  Given my current bandwidth (I contribute to arrow on my free time), this will take a while.  So if people are interested in collaborating (or taking this over) please let me know.
> 
> Also, it looks like someone took my advice and filed a feature request [2] for surfacing apache arrow natively.
> 
> Thanks,
> Micah
> 
> [1] https://cloud.google.com/bigquery/docs/reference/storage/
> [2] https://issuetracker.google.com/issues/124858094
> 
>> On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney <we...@gmail.com> wrote:
>> Would someone like to make some feature requests to Google or engage
>> with them in another way? I have interacted with GCP in the past; I
>> think it would be helpful for them to hear from other Arrow users or
>> community members since I have been quite public as a carrier of the
>> Arrow banner.
>> 
>> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
>> > reflects my own opinions, not those of my company.
>> >
>> > Jonathan and Wes,
>> >
>> > One way of trying to get support for this is filing a feature request at
>> > [1] and getting broader customer support for it.  Another possible way of
>> > gaining broader exposure within Google is collaborating with other open
>> > source projects that it contributes to.  For instance there was a
>> > conversation recently about the potential use of Arrow on the Apache Beam
>> > mailing list [2].  I will try to post a link to this thread internally, but
>> > I can't make any promises and likely not give any updates on progress.
>> >
>> > This is also very much my own opinion, but I think in order to expose Arrow
>> > in a public API it would be nice to reach a stable major release (i.e.
>> > 1.0.0) and ensure Arrow properly supports big query data-types
>> > appropriately [3], (I think it mostly does but date/time might be an issue).
>> >
>> > [1]
>> > https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
>> > [2]
>> > https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
>> > [3] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
>> >
>> >
>> > On Monday, February 4, 2019, Wes McKinney <we...@gmail.com> wrote:
>> >
>> > > Arrow support would be an obvious win for BigQuery. I've spoken with
>> > > people at Google Cloud about this in several occasions.
>> > >
>> > > With the gRPC / Flight work coming along it might be a good
>> > > opportunity to rekindle the discussion. If anyone from GCP is reading
>> > > or if you know anyone at GCP who might be able to work with us I would
>> > > be very interested.
>> > >
>> > > One hurdle for BigQuery is that my understanding is that Google has
>> > > policies in place that make it more difficult to take on external
>> > > library dependencies in a sensitive system like Dremel / BigQuery. So
>> > > someone from Google might have to develop an in-house Arrow
>> > > implementation sufficient to send Arrow datasets from BigQuery to
>> > > clients. The scope of that project is small enough (requiring only
>> > > Flatbuffers as a dependency) that a motivated C or C++ developer at
>> > > Google ought to be able to get it done in a month or two of focused
>> > > work.
>> > >
>> > > - Wes
>> > >
>> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <ch...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Hi Wes,
>> > > >
>> > > > I am currently working a lot with Google BigQuery in R and Python.
>> > > Hadley Wickham listed this as a big bottleneck for his library bigrquery.
>> > > >
>> > > > The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
>> > > format, which is difficult to optimise further because I’m already using
>> > > the fastest C++ JSON parser, RapidJson. If this is still too slow (because
>> > > you download a lot of data), see ?bq_table_download for an alternative
>> > > approach.
>> > > >
>> > > > Is there any momentum for Arrow to partner with Google here?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jonathan
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com> wrote:
>> > > >>
>> > > >> hi Jonathan,
>> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> > Hi Wes and Romain,
>> > > >> >
>> > > >> > I wrote a preliminary benchmark for reading and writing different
>> > > file types from R into arrow, borrowed some code from Hadley. I would like
>> > > some feedback to improve it and then possible push a R/benchmarks folder. I
>> > > am willing to dedicate most of next week to this project, as I am taking a
>> > > vacation from work, but would like to contribute to Arrow and R.
>> > > >> >
>> > > >> > To Romain: What is the difference in R when using tibble versus
>> > > reading from arrow?
>> > > >> > Is the general advantage that you can serialize the data to arrow
>> > > when saving it? Then be able to call it in Python with arrow then pandas?
>> > > >>
>> > > >> Arrow has a language-independent binary protocol for data interchange
>> > > >> that does not require deserialization of data on read. It can be read
>> > > >> or written in many different ways: files, sockets, shared memory, etc.
>> > > >> How it gets used depends on the application
>> > > >>
>> > > >> >
>> > > >> > General Roadmap Question to Wes and Romain :
>> > > >> > My vision for the future of data science, is the ability to serialize
>> > > data securely and pass data and models securely with some form of
>> > > authentication between IDEs with secure ports. This idea would develop with
>> > > something similar to gRPC, with more security designed with sharing data. I
>> > > noticed flight gRpc.
>> > > >> >
>> > > >>
>> > > >> Correct, our plan for RPC is to use gRPC for secure transport of
>> > > >> components of the Arrow columnar protocol. We'd love to have more
>> > > >> developers involved with this effort.
>> > > >>
>> > > >> > Also, I was interested if there was any momentum in  the R community
>> > > to serialize models similar to the work of Onnx into a unified model
>> > > storage system. The idea is to have a secure reproducible environment for R
>> > > and Python developer groups to readily share models and data, with the
>> > > caveat that data sent also has added security and possibly a history
>> > > associated with it for security. This piece of work, is something I am
>> > > passionate in seeing come to fruition. And would like to explore options
>> > > for this actualization.
>> > > >> >
>> > > >>
>> > > >> Here we are focused on efficient handling and processing of datasets.
>> > > >> These tools could be used to build a model storage system if so
>> > > >> desired.
>> > > >>
>> > > >> > The background for me is to enable HealthCare teams to share medical
>> > > data securely among different analytics teams. The security provisions
>> > > would enable more robust cloud based storage and computation in a secure
>> > > fashion.
>> > > >> >
>> > > >>
>> > > >> I would like to see deeper integration with cloud storage services in
>> > > >> 2019 in the core C++ libraries, which would be made available in R,
>> > > >> Python, Ruby, etc.
>> > > >>
>> > > >> - Wes
>> > > >>
>> > > >> > Thanks,
>> > > >> > Jonathan
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > Side Note:
>> > > >> > Building arrow for R on Linux was a big hassle relative to mac. Was
>> > > unable to build on linux.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com>
>> > > wrote:
>> > > >> >>
>> > > >> >> I'll go through that python repo and see what I can do.
>> > > >> >>
>> > > >> >> Thanks,
>> > > >> >> Jonathan
>> > > >> >>
>> > > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > >> >>>
>> > > >> >>> I would suggest starting an r/benchmarks directory like we have in
>> > > >> >>> Python (
>> > > https://github.com/apache/arrow/tree/master/python/benchmarks)
>> > > >> >>> and documenting the process for running all the benchmarks.
>> > > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat>
>> > > wrote:
>> > > >> >>> >
>> > > >> >>> > Right now, most of the code examples is in the unit tests, but
>> > > this is not measuring performance or stressing it. Perhaps you can start
>> > > from there ?
>> > > >> >>> >
>> > > >> >>> > Romain
>> > > >> >>> >
>> > > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a
>> > > écrit :
>> > > >> >>> > >
>> > > >> >>> > > Adding dev@arrow.apache.org
>> > > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
>> > > chiang810@gmail.com> wrote:
>> > > >> >>> > >>
>> > > >> >>> > >> Hi,
>> > > >> >>> > >>
>> > > >> >>> > >> I would like to contribute to developing benchmark suites for
>> > > R and Arrow? What would be the best way to start?
>> > > >> >>> > >>
>> > > >> >>> > >> Thanks,
>> > > >> >>> > >> Jonathan
>> > > >> >>> >
>> > >

Re: Arrow and R benchmark

Posted by Micah Kornfield <em...@gmail.com>.

Just to follow up on this thread, a new high throughput API [1] for reading
data out of big query was released to public beta today.  The format it
streams is AVRO so it should be higher performance then parsing JSON (and
reads can be parallelized).  Implementing AVRO reading was something I was
going to start working on in the next week or so, and I'll probably
continue on to add support to arrow C++ for the new API (I will be creating
JIRAs soon).  Given my current bandwidth (I contribute to arrow on my free
time), this will take a while.  So if people are interested in
collaborating (or taking this over) please let me know.

Also, it looks like someone took my advice and filed a feature request [2]
for surfacing apache arrow natively.

Thanks,
Micah

[1] https://cloud.google.com/bigquery/docs/reference/storage/
[2] https://issuetracker.google.com/issues/124858094

On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney <we...@gmail.com> wrote:

> Would someone like to make some feature requests to Google or engage
> with them in another way? I have interacted with GCP in the past; I
> think it would be helpful for them to hear from other Arrow users or
> community members since I have been quite public as a carrier of the
> Arrow banner.
>
> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
> > reflects my own opinions, not those of my company.
> >
> > Jonathan and Wes,
> >
> > One way of trying to get support for this is filing a feature request at
> > [1] and getting broader customer support for it.  Another possible way of
> > gaining broader exposure within Google is collaborating with other open
> > source projects that it contributes to.  For instance there was a
> > conversation recently about the potential use of Arrow on the Apache Beam
> > mailing list [2].  I will try to post a link to this thread internally,
> but
> > I can't make any promises and likely not give any updates on progress.
> >
> > This is also very much my own opinion, but I think in order to expose
> Arrow
> > in a public API it would be nice to reach a stable major release (i.e.
> > 1.0.0) and ensure Arrow properly supports big query data-types
> > appropriately [3], (I think it mostly does but date/time might be an
> issue).
> >
> > [1]
> >
> https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
> > [2]
> >
> https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
> > [3]
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> >
> >
> > On Monday, February 4, 2019, Wes McKinney <we...@gmail.com> wrote:
> >
> > > Arrow support would be an obvious win for BigQuery. I've spoken with
> > > people at Google Cloud about this in several occasions.
> > >
> > > With the gRPC / Flight work coming along it might be a good
> > > opportunity to rekindle the discussion. If anyone from GCP is reading
> > > or if you know anyone at GCP who might be able to work with us I would
> > > be very interested.
> > >
> > > One hurdle for BigQuery is that my understanding is that Google has
> > > policies in place that make it more difficult to take on external
> > > library dependencies in a sensitive system like Dremel / BigQuery. So
> > > someone from Google might have to develop an in-house Arrow
> > > implementation sufficient to send Arrow datasets from BigQuery to
> > > clients. The scope of that project is small enough (requiring only
> > > Flatbuffers as a dependency) that a motivated C or C++ developer at
> > > Google ought to be able to get it done in a month or two of focused
> > > work.
> > >
> > > - Wes
> > >
> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <ch...@gmail.com>
> > > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > I am currently working a lot with Google BigQuery in R and Python.
> > > Hadley Wickham listed this as a big bottleneck for his library
> bigrquery.
> > > >
> > > > The bottleneck for loading BigQuery data is now parsing BigQuery’s
> JSON
> > > format, which is difficult to optimise further because I’m already
> using
> > > the fastest C++ JSON parser, RapidJson. If this is still too slow
> (because
> > > you download a lot of data), see ?bq_table_download for an alternative
> > > approach.
> > > >
> > > > Is there any momentum for Arrow to partner with Google here?
> > > >
> > > > Thanks,
> > > >
> > > > Jonathan
> > > >
> > > >
> > > >
> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com>
> wrote:
> > > >>
> > > >> hi Jonathan,
> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <
> chiang810@gmail.com>
> > > wrote:
> > > >> >
> > > >> > Hi Wes and Romain,
> > > >> >
> > > >> > I wrote a preliminary benchmark for reading and writing different
> > > file types from R into arrow, borrowed some code from Hadley. I would
> like
> > > some feedback to improve it and then possible push a R/benchmarks
> folder. I
> > > am willing to dedicate most of next week to this project, as I am
> taking a
> > > vacation from work, but would like to contribute to Arrow and R.
> > > >> >
> > > >> > To Romain: What is the difference in R when using tibble versus
> > > reading from arrow?
> > > >> > Is the general advantage that you can serialize the data to arrow
> > > when saving it? Then be able to call it in Python with arrow then
> pandas?
> > > >>
> > > >> Arrow has a language-independent binary protocol for data
> interchange
> > > >> that does not require deserialization of data on read. It can be
> read
> > > >> or written in many different ways: files, sockets, shared memory,
> etc.
> > > >> How it gets used depends on the application
> > > >>
> > > >> >
> > > >> > General Roadmap Question to Wes and Romain :
> > > >> > My vision for the future of data science, is the ability to
> serialize
> > > data securely and pass data and models securely with some form of
> > > authentication between IDEs with secure ports. This idea would develop
> with
> > > something similar to gRPC, with more security designed with sharing
> data. I
> > > noticed flight gRpc.
> > > >> >
> > > >>
> > > >> Correct, our plan for RPC is to use gRPC for secure transport of
> > > >> components of the Arrow columnar protocol. We'd love to have more
> > > >> developers involved with this effort.
> > > >>
> > > >> > Also, I was interested if there was any momentum in  the R
> community
> > > to serialize models similar to the work of Onnx into a unified model
> > > storage system. The idea is to have a secure reproducible environment
> for R
> > > and Python developer groups to readily share models and data, with the
> > > caveat that data sent also has added security and possibly a history
> > > associated with it for security. This piece of work, is something I am
> > > passionate in seeing come to fruition. And would like to explore
> options
> > > for this actualization.
> > > >> >
> > > >>
> > > >> Here we are focused on efficient handling and processing of
> datasets.
> > > >> These tools could be used to build a model storage system if so
> > > >> desired.
> > > >>
> > > >> > The background for me is to enable HealthCare teams to share
> medical
> > > data securely among different analytics teams. The security provisions
> > > would enable more robust cloud based storage and computation in a
> secure
> > > fashion.
> > > >> >
> > > >>
> > > >> I would like to see deeper integration with cloud storage services
> in
> > > >> 2019 in the core C++ libraries, which would be made available in R,
> > > >> Python, Ruby, etc.
> > > >>
> > > >> - Wes
> > > >>
> > > >> > Thanks,
> > > >> > Jonathan
> > > >> >
> > > >> >
> > > >> >
> > > >> > Side Note:
> > > >> > Building arrow for R on Linux was a big hassle relative to mac.
> Was
> > > unable to build on linux.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <
> chiang810@gmail.com>
> > > wrote:
> > > >> >>
> > > >> >> I'll go through that python repo and see what I can do.
> > > >> >>
> > > >> >> Thanks,
> > > >> >> Jonathan
> > > >> >>
> > > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <
> wesmckinn@gmail.com>
> > > wrote:
> > > >> >>>
> > > >> >>> I would suggest starting an r/benchmarks directory like we have
> in
> > > >> >>> Python (
> > > https://github.com/apache/arrow/tree/master/python/benchmarks)
> > > >> >>> and documenting the process for running all the benchmarks.
> > > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <
> romain@purrple.cat>
> > > wrote:
> > > >> >>> >
> > > >> >>> > Right now, most of the code examples is in the unit tests, but
> > > this is not measuring performance or stressing it. Perhaps you can
> start
> > > from there ?
> > > >> >>> >
> > > >> >>> > Romain
> > > >> >>> >
> > > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com>
> a
> > > écrit :
> > > >> >>> > >
> > > >> >>> > > Adding dev@arrow.apache.org
> > > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> > > chiang810@gmail.com> wrote:
> > > >> >>> > >>
> > > >> >>> > >> Hi,
> > > >> >>> > >>
> > > >> >>> > >> I would like to contribute to developing benchmark suites
> for
> > > R and Arrow? What would be the best way to start?
> > > >> >>> > >>
> > > >> >>> > >> Thanks,
> > > >> >>> > >> Jonathan
> > > >> >>> >
> > >
>

Re: Arrow and R benchmark

Posted by Wes McKinney <we...@gmail.com>.

Would someone like to make some feature requests to Google or engage
with them in another way? I have interacted with GCP in the past; I
think it would be helpful for them to hear from other Arrow users or
community members since I have been quite public as a carrier of the
Arrow banner.

On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
> reflects my own opinions, not those of my company.
>
> Jonathan and Wes,
>
> One way of trying to get support for this is filing a feature request at
> [1] and getting broader customer support for it.  Another possible way of
> gaining broader exposure within Google is collaborating with other open
> source projects that it contributes to.  For instance there was a
> conversation recently about the potential use of Arrow on the Apache Beam
> mailing list [2].  I will try to post a link to this thread internally, but
> I can't make any promises and likely not give any updates on progress.
>
> This is also very much my own opinion, but I think in order to expose Arrow
> in a public API it would be nice to reach a stable major release (i.e.
> 1.0.0) and ensure Arrow properly supports big query data-types
> appropriately [3], (I think it mostly does but date/time might be an issue).
>
> [1]
> https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
> [2]
> https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
> [3] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
>
>
> On Monday, February 4, 2019, Wes McKinney <we...@gmail.com> wrote:
>
> > Arrow support would be an obvious win for BigQuery. I've spoken with
> > people at Google Cloud about this in several occasions.
> >
> > With the gRPC / Flight work coming along it might be a good
> > opportunity to rekindle the discussion. If anyone from GCP is reading
> > or if you know anyone at GCP who might be able to work with us I would
> > be very interested.
> >
> > One hurdle for BigQuery is that my understanding is that Google has
> > policies in place that make it more difficult to take on external
> > library dependencies in a sensitive system like Dremel / BigQuery. So
> > someone from Google might have to develop an in-house Arrow
> > implementation sufficient to send Arrow datasets from BigQuery to
> > clients. The scope of that project is small enough (requiring only
> > Flatbuffers as a dependency) that a motivated C or C++ developer at
> > Google ought to be able to get it done in a month or two of focused
> > work.
> >
> > - Wes
> >
> > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <ch...@gmail.com>
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > I am currently working a lot with Google BigQuery in R and Python.
> > Hadley Wickham listed this as a big bottleneck for his library bigrquery.
> > >
> > > The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
> > format, which is difficult to optimise further because I’m already using
> > the fastest C++ JSON parser, RapidJson. If this is still too slow (because
> > you download a lot of data), see ?bq_table_download for an alternative
> > approach.
> > >
> > > Is there any momentum for Arrow to partner with Google here?
> > >
> > > Thanks,
> > >
> > > Jonathan
> > >
> > >
> > >
> > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com> wrote:
> > >>
> > >> hi Jonathan,
> > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com>
> > wrote:
> > >> >
> > >> > Hi Wes and Romain,
> > >> >
> > >> > I wrote a preliminary benchmark for reading and writing different
> > file types from R into arrow, borrowed some code from Hadley. I would like
> > some feedback to improve it and then possible push a R/benchmarks folder. I
> > am willing to dedicate most of next week to this project, as I am taking a
> > vacation from work, but would like to contribute to Arrow and R.
> > >> >
> > >> > To Romain: What is the difference in R when using tibble versus
> > reading from arrow?
> > >> > Is the general advantage that you can serialize the data to arrow
> > when saving it? Then be able to call it in Python with arrow then pandas?
> > >>
> > >> Arrow has a language-independent binary protocol for data interchange
> > >> that does not require deserialization of data on read. It can be read
> > >> or written in many different ways: files, sockets, shared memory, etc.
> > >> How it gets used depends on the application
> > >>
> > >> >
> > >> > General Roadmap Question to Wes and Romain :
> > >> > My vision for the future of data science, is the ability to serialize
> > data securely and pass data and models securely with some form of
> > authentication between IDEs with secure ports. This idea would develop with
> > something similar to gRPC, with more security designed with sharing data. I
> > noticed flight gRpc.
> > >> >
> > >>
> > >> Correct, our plan for RPC is to use gRPC for secure transport of
> > >> components of the Arrow columnar protocol. We'd love to have more
> > >> developers involved with this effort.
> > >>
> > >> > Also, I was interested if there was any momentum in  the R community
> > to serialize models similar to the work of Onnx into a unified model
> > storage system. The idea is to have a secure reproducible environment for R
> > and Python developer groups to readily share models and data, with the
> > caveat that data sent also has added security and possibly a history
> > associated with it for security. This piece of work, is something I am
> > passionate in seeing come to fruition. And would like to explore options
> > for this actualization.
> > >> >
> > >>
> > >> Here we are focused on efficient handling and processing of datasets.
> > >> These tools could be used to build a model storage system if so
> > >> desired.
> > >>
> > >> > The background for me is to enable HealthCare teams to share medical
> > data securely among different analytics teams. The security provisions
> > would enable more robust cloud based storage and computation in a secure
> > fashion.
> > >> >
> > >>
> > >> I would like to see deeper integration with cloud storage services in
> > >> 2019 in the core C++ libraries, which would be made available in R,
> > >> Python, Ruby, etc.
> > >>
> > >> - Wes
> > >>
> > >> > Thanks,
> > >> > Jonathan
> > >> >
> > >> >
> > >> >
> > >> > Side Note:
> > >> > Building arrow for R on Linux was a big hassle relative to mac. Was
> > unable to build on linux.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com>
> > wrote:
> > >> >>
> > >> >> I'll go through that python repo and see what I can do.
> > >> >>
> > >> >> Thanks,
> > >> >> Jonathan
> > >> >>
> > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >> >>>
> > >> >>> I would suggest starting an r/benchmarks directory like we have in
> > >> >>> Python (
> > https://github.com/apache/arrow/tree/master/python/benchmarks)
> > >> >>> and documenting the process for running all the benchmarks.
> > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat>
> > wrote:
> > >> >>> >
> > >> >>> > Right now, most of the code examples is in the unit tests, but
> > this is not measuring performance or stressing it. Perhaps you can start
> > from there ?
> > >> >>> >
> > >> >>> > Romain
> > >> >>> >
> > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a
> > écrit :
> > >> >>> > >
> > >> >>> > > Adding dev@arrow.apache.org
> > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> > chiang810@gmail.com> wrote:
> > >> >>> > >>
> > >> >>> > >> Hi,
> > >> >>> > >>
> > >> >>> > >> I would like to contribute to developing benchmark suites for
> > R and Arrow? What would be the best way to start?
> > >> >>> > >>
> > >> >>> > >> Thanks,
> > >> >>> > >> Jonathan
> > >> >>> >
> >

Re: Arrow and R benchmark

Posted by Micah Kornfield <em...@gmail.com>.

Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
reflects my own opinions, not those of my company.

Jonathan and Wes,

One way of trying to get support for this is filing a feature request at
[1] and getting broader customer support for it.  Another possible way of
gaining broader exposure within Google is collaborating with other open
source projects that it contributes to.  For instance there was a
conversation recently about the potential use of Arrow on the Apache Beam
mailing list [2].  I will try to post a link to this thread internally, but
I can't make any promises and likely not give any updates on progress.

This is also very much my own opinion, but I think in order to expose Arrow
in a public API it would be nice to reach a stable major release (i.e.
1.0.0) and ensure Arrow properly supports big query data-types
appropriately [3], (I think it mostly does but date/time might be an issue).

[1]
https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
[2]
https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types


On Monday, February 4, 2019, Wes McKinney <we...@gmail.com> wrote:

> Arrow support would be an obvious win for BigQuery. I've spoken with
> people at Google Cloud about this in several occasions.
>
> With the gRPC / Flight work coming along it might be a good
> opportunity to rekindle the discussion. If anyone from GCP is reading
> or if you know anyone at GCP who might be able to work with us I would
> be very interested.
>
> One hurdle for BigQuery is that my understanding is that Google has
> policies in place that make it more difficult to take on external
> library dependencies in a sensitive system like Dremel / BigQuery. So
> someone from Google might have to develop an in-house Arrow
> implementation sufficient to send Arrow datasets from BigQuery to
> clients. The scope of that project is small enough (requiring only
> Flatbuffers as a dependency) that a motivated C or C++ developer at
> Google ought to be able to get it done in a month or two of focused
> work.
>
> - Wes
>
> On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <ch...@gmail.com>
> wrote:
> >
> > Hi Wes,
> >
> > I am currently working a lot with Google BigQuery in R and Python.
> Hadley Wickham listed this as a big bottleneck for his library bigrquery.
> >
> > The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
> format, which is difficult to optimise further because I’m already using
> the fastest C++ JSON parser, RapidJson. If this is still too slow (because
> you download a lot of data), see ?bq_table_download for an alternative
> approach.
> >
> > Is there any momentum for Arrow to partner with Google here?
> >
> > Thanks,
> >
> > Jonathan
> >
> >
> >
> > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> hi Jonathan,
> >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com>
> wrote:
> >> >
> >> > Hi Wes and Romain,
> >> >
> >> > I wrote a preliminary benchmark for reading and writing different
> file types from R into arrow, borrowed some code from Hadley. I would like
> some feedback to improve it and then possible push a R/benchmarks folder. I
> am willing to dedicate most of next week to this project, as I am taking a
> vacation from work, but would like to contribute to Arrow and R.
> >> >
> >> > To Romain: What is the difference in R when using tibble versus
> reading from arrow?
> >> > Is the general advantage that you can serialize the data to arrow
> when saving it? Then be able to call it in Python with arrow then pandas?
> >>
> >> Arrow has a language-independent binary protocol for data interchange
> >> that does not require deserialization of data on read. It can be read
> >> or written in many different ways: files, sockets, shared memory, etc.
> >> How it gets used depends on the application
> >>
> >> >
> >> > General Roadmap Question to Wes and Romain :
> >> > My vision for the future of data science, is the ability to serialize
> data securely and pass data and models securely with some form of
> authentication between IDEs with secure ports. This idea would develop with
> something similar to gRPC, with more security designed with sharing data. I
> noticed flight gRpc.
> >> >
> >>
> >> Correct, our plan for RPC is to use gRPC for secure transport of
> >> components of the Arrow columnar protocol. We'd love to have more
> >> developers involved with this effort.
> >>
> >> > Also, I was interested if there was any momentum in  the R community
> to serialize models similar to the work of Onnx into a unified model
> storage system. The idea is to have a secure reproducible environment for R
> and Python developer groups to readily share models and data, with the
> caveat that data sent also has added security and possibly a history
> associated with it for security. This piece of work, is something I am
> passionate in seeing come to fruition. And would like to explore options
> for this actualization.
> >> >
> >>
> >> Here we are focused on efficient handling and processing of datasets.
> >> These tools could be used to build a model storage system if so
> >> desired.
> >>
> >> > The background for me is to enable HealthCare teams to share medical
> data securely among different analytics teams. The security provisions
> would enable more robust cloud based storage and computation in a secure
> fashion.
> >> >
> >>
> >> I would like to see deeper integration with cloud storage services in
> >> 2019 in the core C++ libraries, which would be made available in R,
> >> Python, Ruby, etc.
> >>
> >> - Wes
> >>
> >> > Thanks,
> >> > Jonathan
> >> >
> >> >
> >> >
> >> > Side Note:
> >> > Building arrow for R on Linux was a big hassle relative to mac. Was
> unable to build on linux.
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com>
> wrote:
> >> >>
> >> >> I'll go through that python repo and see what I can do.
> >> >>
> >> >> Thanks,
> >> >> Jonathan
> >> >>
> >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com>
> wrote:
> >> >>>
> >> >>> I would suggest starting an r/benchmarks directory like we have in
> >> >>> Python (
> https://github.com/apache/arrow/tree/master/python/benchmarks)
> >> >>> and documenting the process for running all the benchmarks.
> >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat>
> wrote:
> >> >>> >
> >> >>> > Right now, most of the code examples is in the unit tests, but
> this is not measuring performance or stressing it. Perhaps you can start
> from there ?
> >> >>> >
> >> >>> > Romain
> >> >>> >
> >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a
> écrit :
> >> >>> > >
> >> >>> > > Adding dev@arrow.apache.org
> >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> chiang810@gmail.com> wrote:
> >> >>> > >>
> >> >>> > >> Hi,
> >> >>> > >>
> >> >>> > >> I would like to contribute to developing benchmark suites for
> R and Arrow? What would be the best way to start?
> >> >>> > >>
> >> >>> > >> Thanks,
> >> >>> > >> Jonathan
> >> >>> >
>

Re: Arrow and R benchmark

Posted by Wes McKinney <we...@gmail.com>.

Arrow support would be an obvious win for BigQuery. I've spoken with
people at Google Cloud about this in several occasions.

With the gRPC / Flight work coming along it might be a good
opportunity to rekindle the discussion. If anyone from GCP is reading
or if you know anyone at GCP who might be able to work with us I would
be very interested.

One hurdle for BigQuery is that my understanding is that Google has
policies in place that make it more difficult to take on external
library dependencies in a sensitive system like Dremel / BigQuery. So
someone from Google might have to develop an in-house Arrow
implementation sufficient to send Arrow datasets from BigQuery to
clients. The scope of that project is small enough (requiring only
Flatbuffers as a dependency) that a motivated C or C++ developer at
Google ought to be able to get it done in a month or two of focused
work.

- Wes

On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <ch...@gmail.com> wrote:
>
> Hi Wes,
>
> I am currently working a lot with Google BigQuery in R and Python. Hadley Wickham listed this as a big bottleneck for his library bigrquery.
>
> The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON format, which is difficult to optimise further because I’m already using the fastest C++ JSON parser, RapidJson. If this is still too slow (because you download a lot of data), see ?bq_table_download for an alternative approach.
>
> Is there any momentum for Arrow to partner with Google here?
>
> Thanks,
>
> Jonathan
>
>
>
> On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Jonathan,
>> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com> wrote:
>> >
>> > Hi Wes and Romain,
>> >
>> > I wrote a preliminary benchmark for reading and writing different file types from R into arrow, borrowed some code from Hadley. I would like some feedback to improve it and then possible push a R/benchmarks folder. I am willing to dedicate most of next week to this project, as I am taking a vacation from work, but would like to contribute to Arrow and R.
>> >
>> > To Romain: What is the difference in R when using tibble versus reading from arrow?
>> > Is the general advantage that you can serialize the data to arrow when saving it? Then be able to call it in Python with arrow then pandas?
>>
>> Arrow has a language-independent binary protocol for data interchange
>> that does not require deserialization of data on read. It can be read
>> or written in many different ways: files, sockets, shared memory, etc.
>> How it gets used depends on the application
>>
>> >
>> > General Roadmap Question to Wes and Romain :
>> > My vision for the future of data science, is the ability to serialize data securely and pass data and models securely with some form of authentication between IDEs with secure ports. This idea would develop with something similar to gRPC, with more security designed with sharing data. I noticed flight gRpc.
>> >
>>
>> Correct, our plan for RPC is to use gRPC for secure transport of
>> components of the Arrow columnar protocol. We'd love to have more
>> developers involved with this effort.
>>
>> > Also, I was interested if there was any momentum in  the R community to serialize models similar to the work of Onnx into a unified model storage system. The idea is to have a secure reproducible environment for R and Python developer groups to readily share models and data, with the caveat that data sent also has added security and possibly a history associated with it for security. This piece of work, is something I am passionate in seeing come to fruition. And would like to explore options for this actualization.
>> >
>>
>> Here we are focused on efficient handling and processing of datasets.
>> These tools could be used to build a model storage system if so
>> desired.
>>
>> > The background for me is to enable HealthCare teams to share medical data securely among different analytics teams. The security provisions would enable more robust cloud based storage and computation in a secure fashion.
>> >
>>
>> I would like to see deeper integration with cloud storage services in
>> 2019 in the core C++ libraries, which would be made available in R,
>> Python, Ruby, etc.
>>
>> - Wes
>>
>> > Thanks,
>> > Jonathan
>> >
>> >
>> >
>> > Side Note:
>> > Building arrow for R on Linux was a big hassle relative to mac. Was unable to build on linux.
>> >
>> >
>> >
>> >
>> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com> wrote:
>> >>
>> >> I'll go through that python repo and see what I can do.
>> >>
>> >> Thanks,
>> >> Jonathan
>> >>
>> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com> wrote:
>> >>>
>> >>> I would suggest starting an r/benchmarks directory like we have in
>> >>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks)
>> >>> and documenting the process for running all the benchmarks.
>> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat> wrote:
>> >>> >
>> >>> > Right now, most of the code examples is in the unit tests, but this is not measuring performance or stressing it. Perhaps you can start from there ?
>> >>> >
>> >>> > Romain
>> >>> >
>> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a écrit :
>> >>> > >
>> >>> > > Adding dev@arrow.apache.org
>> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <ch...@gmail.com> wrote:
>> >>> > >>
>> >>> > >> Hi,
>> >>> > >>
>> >>> > >> I would like to contribute to developing benchmark suites for R and Arrow? What would be the best way to start?
>> >>> > >>
>> >>> > >> Thanks,
>> >>> > >> Jonathan
>> >>> >

Re: Arrow and R benchmark

Posted by Jonathan Chiang <ch...@gmail.com>.

Hi Wes,

I am currently working a lot with Google BigQuery in R and Python. Hadley
Wickham listed this as a big bottleneck for his library bigrquery.

*The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
format, which is difficult to optimise further because I’m already using
the fastest C++ JSON parser, RapidJson <http://rapidjson.org/>. If this is
still too slow (because you download a lot of data),
see ?bq_table_download for an alternative approach.*

Is there any momentum for Arrow to partner with Google here?

Thanks,

Jonathan



On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <we...@gmail.com> wrote:

> hi Jonathan,
> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <ch...@gmail.com>
> wrote:
> >
> > Hi Wes and Romain,
> >
> > I wrote a preliminary benchmark for reading and writing different file
> types from R into arrow, borrowed some code from Hadley. I would like some
> feedback to improve it and then possible push a R/benchmarks folder. I am
> willing to dedicate most of next week to this project, as I am taking a
> vacation from work, but would like to contribute to Arrow and R.
> >
> > To Romain: What is the difference in R when using tibble versus reading
> from arrow?
> > Is the general advantage that you can serialize the data to arrow when
> saving it? Then be able to call it in Python with arrow then pandas?
>
> Arrow has a language-independent binary protocol for data interchange
> that does not require deserialization of data on read. It can be read
> or written in many different ways: files, sockets, shared memory, etc.
> How it gets used depends on the application
>
> >
> > General Roadmap Question to Wes and Romain :
> > My vision for the future of data science, is the ability to serialize
> data securely and pass data and models securely with some form of
> authentication between IDEs with secure ports. This idea would develop with
> something similar to gRPC, with more security designed with sharing data. I
> noticed flight gRpc.
> >
>
> Correct, our plan for RPC is to use gRPC for secure transport of
> components of the Arrow columnar protocol. We'd love to have more
> developers involved with this effort.
>
> > Also, I was interested if there was any momentum in  the R community to
> serialize models similar to the work of Onnx into a unified model storage
> system. The idea is to have a secure reproducible environment for R and
> Python developer groups to readily share models and data, with the caveat
> that data sent also has added security and possibly a history associated
> with it for security. This piece of work, is something I am passionate in
> seeing come to fruition. And would like to explore options for this
> actualization.
> >
>
> Here we are focused on efficient handling and processing of datasets.
> These tools could be used to build a model storage system if so
> desired.
>
> > The background for me is to enable HealthCare teams to share medical
> data securely among different analytics teams. The security provisions
> would enable more robust cloud based storage and computation in a secure
> fashion.
> >
>
> I would like to see deeper integration with cloud storage services in
> 2019 in the core C++ libraries, which would be made available in R,
> Python, Ruby, etc.
>
> - Wes
>
> > Thanks,
> > Jonathan
> >
> >
> >
> > Side Note:
> > Building arrow for R on Linux was a big hassle relative to mac. Was
> unable to build on linux.
> >
> >
> >
> >
> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <ch...@gmail.com>
> wrote:
> >>
> >> I'll go through that python repo and see what I can do.
> >>
> >> Thanks,
> >> Jonathan
> >>
> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>
> >>> I would suggest starting an r/benchmarks directory like we have in
> >>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks)
> >>> and documenting the process for running all the benchmarks.
> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <ro...@purrple.cat>
> wrote:
> >>> >
> >>> > Right now, most of the code examples is in the unit tests, but this
> is not measuring performance or stressing it. Perhaps you can start from
> there ?
> >>> >
> >>> > Romain
> >>> >
> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <we...@gmail.com> a
> écrit :
> >>> > >
> >>> > > Adding dev@arrow.apache.org
> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> chiang810@gmail.com> wrote:
> >>> > >>
> >>> > >> Hi,
> >>> > >>
> >>> > >> I would like to contribute to developing benchmark suites for R
> and Arrow? What would be the best way to start?
> >>> > >>
> >>> > >> Thanks,
> >>> > >> Jonathan
> >>> >
>