You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2021/06/06 06:46:40 UTC

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Hi,

Thanks a lot for your feedback. I agree with all the arguments put forward,
including Andrew's point about the large change.

I tried a gradual 4 months ago, but it was really difficult and I gave up.
I estimate that the work involved is half the work of writing parquet2 and
arrow2 in the first place. The internal dependency on ArrayData (the main
culprit of the unsafe) on arrow-rs is so prevalent that all core components
need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
compute, SIMD). I personally do not have the motivation to do it, though.

Jed, the public API changes are small for end users. A typical migration is
[1]. I agree that we can further reduce the change-set by keeping legacy
interfaces available.

Andy, on my machine, the current benchmarks on query 1 yield:

type, master (ms), PR [2] for arrow2+parquet2 (ms)
memory (-m): 332.9, 239.6
load (the initial time in -m with --format parquet): 5286.0, 3043.0
parquet format: 1316.1, 930.7
tbl format: 5297.3, 5383.1

i.e. I am observing some improvements. Queries with joins are still slower.
The pruning of parquet groups and pages based on stats are not yet there; I
am working on them.

I agree that this should go through IP clearance. I will start this
process. My thinking would be to create two empty repos on apache/*, and
create 2 PRs from the main branches of each of my repos to those repos, and
only merge them once IP is cleared. Would that be a reasonable process, Wes?

Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?

Best,
Jorge

[1]
https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
[2] https://github.com/apache/arrow-datafusion/pull/68


On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com> wrote:

> I played around with it, for my use case I really like the new way of
> writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> function as well.
>
> I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
> bunch of files in a directory and spit out a CSV, the bottleneck is the
> parsing of lots of files, but it's pretty quick per file.
>
> old:
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> 17127928 bytes took 159ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> 17127144 bytes took 160ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> 17130352 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> 17128544 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> 17128664 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> 17128328 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> 17129288 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> 17131056 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> 17130344 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> 17128432 bytes took 160ms
>
> new:
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> 17127928 bytes took 157ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> 17127144 bytes took 152ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> 17130352 bytes took 154ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> 17128544 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> 17128664 bytes took 154ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> 17128328 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> 17129288 bytes took 152ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> 17131056 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> 17130344 bytes took 155ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> 17128432 bytes took 153ms
>
> I'm going to chunk the dirs to speed up the reads and throw it into a par
> iter.
>
> On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com> wrote:
>
> > Hi!
> >
> > I've been using arrow/arrow-rs for a while now, my use case is to parse
> > Arrow streaming files and convert them into CSV.
> >
> > Rust has been an absolute fantastic tool for this, the performance is
> > outstanding and I have had no issues using it for my use case.
> >
> > I would be happy to test out the branch and let you know what the
> > performance is like, as I was going to improve the current implementation
> > that i have for the CSV writer, as it takes a while for bigger datasets
> > (multi-GB).
> >
> > Josh
> >
> >
> > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> >
> >> Andy Grove <an...@gmail.com> writes:
> >> >
> >> > Looking at this purely from the DataFusion/Ballista point of view,
> what
> >> I
> >> > would be interested in would be having a branch of DF that uses arrow2
> >> and
> >> > once that branch has all tests passing and can run queries with
> >> performance
> >> > that is at least as good as the original arrow crate, then cut over.
> >> >
> >> > However, for developers using the arrow APIs directly, I don't see an
> >> easy
> >> > path. We either try and gradually PR the changes in (which seems
> really
> >> > hard given that there are significant changes to APIs and internal
> data
> >> > structures) or we port some portion of the existing tests over to
> arrow2
> >> > and then make that the official crate once all test pass.
> >>
> >> How feasible would it be to make a legacy module in arrow2 that would
> >> enable (some large subset of) existing arrow users to try arrow2 after
> >> adjusting their use statements? (That is, implement the public-facing
> >> legacy interfaces in terms of arrow2's new, safe interface.) This would
> >> make it easier to test with DataFusion/Ballista and external users of
> the
> >> current arrow crate, then cut over and let those packages update
> >> incrementally from legacy to modern arrow2.
> >>
> >> I think it would be okay to tolerate some performance degradation when
> >> working through these legacy interfaces,so long as there was confidence
> >> that modernizing the callers would recover the performance (as tests
> have
> >> been showing).
> >>
> >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Awesome. Thanks Wes.

I have now initiated the vote for both projects.

Best,
Jorge


On Sat, Jul 10, 2021 at 1:26 PM Wes McKinney <we...@gmail.com> wrote:

> The process for updating the website is described on
>
> https://incubator.apache.org/guides/website.html
>
> It looks like you need to add the new entries to the index.xml file
> and then trigger a website build (which should be triggered by changes
> to SVN, but if not you can trigger one manually through Jenkins).
>
> After the new IP clearance pages are visible you should send an IP
> clearance lazy consensus vote to general@incubator.apache.org like
>
>
> https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E
>
> On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão
>
> <jo...@gmail.com> wrote:
> >
> > Thanks a lot Wes,
> >
> > I am not sure how to proceed from here:
> >
> > 1. how do we generate the html from the xml? I.e.
> > https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
> > 2. how do I trigger the the process to start? can I just email the
> > incubator with the proposal?
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Great, thanks for the update and pushing this forward. Let us know if
> > > you need help with anything.
> > >
> > > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Wes and Neils,
> > > >
> > > > Thank you for your feedback and offer. I have created the two .xml
> > > reports:
> > > >
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> > > >
> > > > I based them on the report for Ballista. I also requested, on the PRs
> > > > [1,2], clarification wrt to every contributors' contributions to
> each.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > > >
> > > >
> > > >
> > > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > > > <jo...@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks a lot for your feedback. I agree with all the arguments
> put
> > > > > forward,
> > > > > > including Andrew's point about the large change.
> > > > > >
> > > > > > I tried a gradual 4 months ago, but it was really difficult and I
> > > gave
> > > > > up.
> > > > > > I estimate that the work involved is half the work of writing
> > > parquet2
> > > > > and
> > > > > > arrow2 in the first place. The internal dependency on ArrayData
> (the
> > > main
> > > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > > > components
> > > > > > need to be re-written from scratch (IPC, FFI, IO,
> array/transform/*,
> > > > > > compute, SIMD). I personally do not have the motivation to do it,
> > > though.
> > > > > >
> > > > > > Jed, the public API changes are small for end users. A typical
> > > migration
> > > > > is
> > > > > > [1]. I agree that we can further reduce the change-set by keeping
> > > legacy
> > > > > > interfaces available.
> > > > > >
> > > > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > > > >
> > > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > > > memory (-m): 332.9, 239.6
> > > > > > load (the initial time in -m with --format parquet): 5286.0,
> 3043.0
> > > > > > parquet format: 1316.1, 930.7
> > > > > > tbl format: 5297.3, 5383.1
> > > > > >
> > > > > > i.e. I am observing some improvements. Queries with joins are
> still
> > > > > slower.
> > > > > > The pruning of parquet groups and pages based on stats are not
> yet
> > > > > there; I
> > > > > > am working on them.
> > > > > >
> > > > > > I agree that this should go through IP clearance. I will start
> this
> > > > > > process. My thinking would be to create two empty repos on
> apache/*,
> > > and
> > > > > > create 2 PRs from the main branches of each of my repos to those
> > > repos,
> > > > > and
> > > > > > only merge them once IP is cleared. Would that be a reasonable
> > > process,
> > > > > Wes?
> > > > >
> > > > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > > > clearance process having done it several times in the past. I don't
> > > > > have an opinion about the names, but having experimental- in the
> name
> > > > > sounds in line with the previous discussion we had about this.
> > > > >
> > > > > > Names: arrow-experimental-rs2 and
> arrow-experimental-rs-parquet2, or?
> > > > > >
> > > > > > Best,
> > > > > > Jorge
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > > > >
> > > > > >
> > > > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <
> joshuataylorx@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > I played around with it, for my use case I really like the new
> way
> > > of
> > > > > > > writing CSVs, it's much more obvious. I love the
> > > `read_stream_metadata`
> > > > > > > function as well.
> > > > > > >
> > > > > > > I'm seeing a very slight speed (~8ms) improvement on my end,
> but I
> > > > > read a
> > > > > > > bunch of files in a directory and spit out a CSV, the
> bottleneck
> > > is the
> > > > > > > parsing of lots of files, but it's pretty quick per file.
> > > > > > >
> > > > > > > old:
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > > 120224
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > > 123144
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > > 17127928 bytes took 159ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > > 17127144 bytes took 160ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > > 17130352 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > > 17128544 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > > 17128664 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > > 17128328 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > > 17129288 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > > 17131056 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > > 17130344 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > > 17128432 bytes took 160ms
> > > > > > >
> > > > > > > new:
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > > 120224
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > > 123144
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > > 17127928 bytes took 157ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > > 17127144 bytes took 152ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > > 17130352 bytes took 154ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > > 17128544 bytes took 153ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > > 17128664 bytes took 154ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > > 17128328 bytes took 153ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > > 17129288 bytes took 152ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > > 17131056 bytes took 153ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > > 17130344 bytes took 155ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > > 17128432 bytes took 153ms
> > > > > > >
> > > > > > > I'm going to chunk the dirs to speed up the reads and throw it
> > > into a
> > > > > par
> > > > > > > iter.
> > > > > > >
> > > > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <
> joshuataylorx@gmail.com
> > > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I've been using arrow/arrow-rs for a while now, my use case
> is to
> > > > > parse
> > > > > > > > Arrow streaming files and convert them into CSV.
> > > > > > > >
> > > > > > > > Rust has been an absolute fantastic tool for this, the
> > > performance is
> > > > > > > > outstanding and I have had no issues using it for my use
> case.
> > > > > > > >
> > > > > > > > I would be happy to test out the branch and let you know
> what the
> > > > > > > > performance is like, as I was going to improve the current
> > > > > implementation
> > > > > > > > that i have for the CSV writer, as it takes a while for
> bigger
> > > > > datasets
> > > > > > > > (multi-GB).
> > > > > > > >
> > > > > > > > Josh
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org>
> > > wrote:
> > > > > > > >
> > > > > > > >> Andy Grove <an...@gmail.com> writes:
> > > > > > > >> >
> > > > > > > >> > Looking at this purely from the DataFusion/Ballista point
> of
> > > view,
> > > > > > > what
> > > > > > > >> I
> > > > > > > >> > would be interested in would be having a branch of DF that
> > > uses
> > > > > arrow2
> > > > > > > >> and
> > > > > > > >> > once that branch has all tests passing and can run queries
> > > with
> > > > > > > >> performance
> > > > > > > >> > that is at least as good as the original arrow crate,
> then cut
> > > > > over.
> > > > > > > >> >
> > > > > > > >> > However, for developers using the arrow APIs directly, I
> don't
> > > > > see an
> > > > > > > >> easy
> > > > > > > >> > path. We either try and gradually PR the changes in (which
> > > seems
> > > > > > > really
> > > > > > > >> > hard given that there are significant changes to APIs and
> > > internal
> > > > > > > data
> > > > > > > >> > structures) or we port some portion of the existing tests
> > > over to
> > > > > > > arrow2
> > > > > > > >> > and then make that the official crate once all test pass.
> > > > > > > >>
> > > > > > > >> How feasible would it be to make a legacy module in arrow2
> that
> > > > > would
> > > > > > > >> enable (some large subset of) existing arrow users to try
> arrow2
> > > > > after
> > > > > > > >> adjusting their use statements? (That is, implement the
> > > > > public-facing
> > > > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> > > This
> > > > > would
> > > > > > > >> make it easier to test with DataFusion/Ballista and external
> > > users
> > > > > of
> > > > > > > the
> > > > > > > >> current arrow crate, then cut over and let those packages
> update
> > > > > > > >> incrementally from legacy to modern arrow2.
> > > > > > > >>
> > > > > > > >> I think it would be okay to tolerate some performance
> > > degradation
> > > > > when
> > > > > > > >> working through these legacy interfaces,so long as there was
> > > > > confidence
> > > > > > > >> that modernizing the callers would recover the performance
> (as
> > > tests
> > > > > > > have
> > > > > > > >> been showing).
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

The process for updating the website is described on

https://incubator.apache.org/guides/website.html

It looks like you need to add the new entries to the index.xml file
and then trigger a website build (which should be triggered by changes
to SVN, but if not you can trigger one manually through Jenkins).

After the new IP clearance pages are visible you should send an IP
clearance lazy consensus vote to general@incubator.apache.org like

https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E

On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão

<jo...@gmail.com> wrote:
>
> Thanks a lot Wes,
>
> I am not sure how to proceed from here:
>
> 1. how do we generate the html from the xml? I.e.
> https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
> 2. how do I trigger the the process to start? can I just email the
> incubator with the proposal?
>
> Best,
> Jorge
>
>
>
> On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Great, thanks for the update and pushing this forward. Let us know if
> > you need help with anything.
> >
> > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> > <jo...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Wes and Neils,
> > >
> > > Thank you for your feedback and offer. I have created the two .xml
> > reports:
> > >
> > >
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> > >
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> > >
> > > I based them on the report for Ballista. I also requested, on the PRs
> > > [1,2], clarification wrt to every contributors' contributions to each.
> > >
> > > Best,
> > > Jorge
> > >
> > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > >
> > >
> > >
> > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > > <jo...@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > > forward,
> > > > > including Andrew's point about the large change.
> > > > >
> > > > > I tried a gradual 4 months ago, but it was really difficult and I
> > gave
> > > > up.
> > > > > I estimate that the work involved is half the work of writing
> > parquet2
> > > > and
> > > > > arrow2 in the first place. The internal dependency on ArrayData (the
> > main
> > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > > components
> > > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > > compute, SIMD). I personally do not have the motivation to do it,
> > though.
> > > > >
> > > > > Jed, the public API changes are small for end users. A typical
> > migration
> > > > is
> > > > > [1]. I agree that we can further reduce the change-set by keeping
> > legacy
> > > > > interfaces available.
> > > > >
> > > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > > >
> > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > > memory (-m): 332.9, 239.6
> > > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > > parquet format: 1316.1, 930.7
> > > > > tbl format: 5297.3, 5383.1
> > > > >
> > > > > i.e. I am observing some improvements. Queries with joins are still
> > > > slower.
> > > > > The pruning of parquet groups and pages based on stats are not yet
> > > > there; I
> > > > > am working on them.
> > > > >
> > > > > I agree that this should go through IP clearance. I will start this
> > > > > process. My thinking would be to create two empty repos on apache/*,
> > and
> > > > > create 2 PRs from the main branches of each of my repos to those
> > repos,
> > > > and
> > > > > only merge them once IP is cleared. Would that be a reasonable
> > process,
> > > > Wes?
> > > >
> > > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > > clearance process having done it several times in the past. I don't
> > > > have an opinion about the names, but having experimental- in the name
> > > > sounds in line with the previous discussion we had about this.
> > > >
> > > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > > >
> > > > >
> > > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <joshuataylorx@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > I played around with it, for my use case I really like the new way
> > of
> > > > > > writing CSVs, it's much more obvious. I love the
> > `read_stream_metadata`
> > > > > > function as well.
> > > > > >
> > > > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > > > read a
> > > > > > bunch of files in a directory and spit out a CSV, the bottleneck
> > is the
> > > > > > parsing of lots of files, but it's pretty quick per file.
> > > > > >
> > > > > > old:
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > 120224
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > 123144
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > 17127928 bytes took 159ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > 17127144 bytes took 160ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > 17130352 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > 17128544 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > 17128664 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > 17128328 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > 17129288 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > 17131056 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > 17130344 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > 17128432 bytes took 160ms
> > > > > >
> > > > > > new:
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > 120224
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > 123144
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > 17127928 bytes took 157ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > 17127144 bytes took 152ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > 17130352 bytes took 154ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > 17128544 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > 17128664 bytes took 154ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > 17128328 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > 17129288 bytes took 152ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > 17131056 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > 17130344 bytes took 155ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > 17128432 bytes took 153ms
> > > > > >
> > > > > > I'm going to chunk the dirs to speed up the reads and throw it
> > into a
> > > > par
> > > > > > iter.
> > > > > >
> > > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuataylorx@gmail.com
> > >
> > > > wrote:
> > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > > > parse
> > > > > > > Arrow streaming files and convert them into CSV.
> > > > > > >
> > > > > > > Rust has been an absolute fantastic tool for this, the
> > performance is
> > > > > > > outstanding and I have had no issues using it for my use case.
> > > > > > >
> > > > > > > I would be happy to test out the branch and let you know what the
> > > > > > > performance is like, as I was going to improve the current
> > > > implementation
> > > > > > > that i have for the CSV writer, as it takes a while for bigger
> > > > datasets
> > > > > > > (multi-GB).
> > > > > > >
> > > > > > > Josh
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org>
> > wrote:
> > > > > > >
> > > > > > >> Andy Grove <an...@gmail.com> writes:
> > > > > > >> >
> > > > > > >> > Looking at this purely from the DataFusion/Ballista point of
> > view,
> > > > > > what
> > > > > > >> I
> > > > > > >> > would be interested in would be having a branch of DF that
> > uses
> > > > arrow2
> > > > > > >> and
> > > > > > >> > once that branch has all tests passing and can run queries
> > with
> > > > > > >> performance
> > > > > > >> > that is at least as good as the original arrow crate, then cut
> > > > over.
> > > > > > >> >
> > > > > > >> > However, for developers using the arrow APIs directly, I don't
> > > > see an
> > > > > > >> easy
> > > > > > >> > path. We either try and gradually PR the changes in (which
> > seems
> > > > > > really
> > > > > > >> > hard given that there are significant changes to APIs and
> > internal
> > > > > > data
> > > > > > >> > structures) or we port some portion of the existing tests
> > over to
> > > > > > arrow2
> > > > > > >> > and then make that the official crate once all test pass.
> > > > > > >>
> > > > > > >> How feasible would it be to make a legacy module in arrow2 that
> > > > would
> > > > > > >> enable (some large subset of) existing arrow users to try arrow2
> > > > after
> > > > > > >> adjusting their use statements? (That is, implement the
> > > > public-facing
> > > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> > This
> > > > would
> > > > > > >> make it easier to test with DataFusion/Ballista and external
> > users
> > > > of
> > > > > > the
> > > > > > >> current arrow crate, then cut over and let those packages update
> > > > > > >> incrementally from legacy to modern arrow2.
> > > > > > >>
> > > > > > >> I think it would be okay to tolerate some performance
> > degradation
> > > > when
> > > > > > >> working through these legacy interfaces,so long as there was
> > > > confidence
> > > > > > >> that modernizing the callers would recover the performance (as
> > tests
> > > > > > have
> > > > > > >> been showing).
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Thanks a lot Wes,

I am not sure how to proceed from here:

1. how do we generate the html from the xml? I.e.
https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
2. how do I trigger the the process to start? can I just email the
incubator with the proposal?

Best,
Jorge



On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <we...@gmail.com> wrote:

> Great, thanks for the update and pushing this forward. Let us know if
> you need help with anything.
>
> On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> <jo...@gmail.com> wrote:
> >
> > Hi,
> >
> > Wes and Neils,
> >
> > Thank you for your feedback and offer. I have created the two .xml
> reports:
> >
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> >
> > I based them on the report for Ballista. I also requested, on the PRs
> > [1,2], clarification wrt to every contributors' contributions to each.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> >
> >
> >
> > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > forward,
> > > > including Andrew's point about the large change.
> > > >
> > > > I tried a gradual 4 months ago, but it was really difficult and I
> gave
> > > up.
> > > > I estimate that the work involved is half the work of writing
> parquet2
> > > and
> > > > arrow2 in the first place. The internal dependency on ArrayData (the
> main
> > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > components
> > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > compute, SIMD). I personally do not have the motivation to do it,
> though.
> > > >
> > > > Jed, the public API changes are small for end users. A typical
> migration
> > > is
> > > > [1]. I agree that we can further reduce the change-set by keeping
> legacy
> > > > interfaces available.
> > > >
> > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > >
> > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > memory (-m): 332.9, 239.6
> > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > parquet format: 1316.1, 930.7
> > > > tbl format: 5297.3, 5383.1
> > > >
> > > > i.e. I am observing some improvements. Queries with joins are still
> > > slower.
> > > > The pruning of parquet groups and pages based on stats are not yet
> > > there; I
> > > > am working on them.
> > > >
> > > > I agree that this should go through IP clearance. I will start this
> > > > process. My thinking would be to create two empty repos on apache/*,
> and
> > > > create 2 PRs from the main branches of each of my repos to those
> repos,
> > > and
> > > > only merge them once IP is cleared. Would that be a reasonable
> process,
> > > Wes?
> > >
> > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > clearance process having done it several times in the past. I don't
> > > have an opinion about the names, but having experimental- in the name
> > > sounds in line with the previous discussion we had about this.
> > >
> > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1]
> > > >
> > >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > >
> > > >
> > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <joshuataylorx@gmail.com
> >
> > > wrote:
> > > >
> > > > > I played around with it, for my use case I really like the new way
> of
> > > > > writing CSVs, it's much more obvious. I love the
> `read_stream_metadata`
> > > > > function as well.
> > > > >
> > > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > > read a
> > > > > bunch of files in a directory and spit out a CSV, the bottleneck
> is the
> > > > > parsing of lots of files, but it's pretty quick per file.
> > > > >
> > > > > old:
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > 120224
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > 123144
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > 17127928 bytes took 159ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > 17127144 bytes took 160ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > 17130352 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > 17128544 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > 17128664 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > 17128328 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > 17129288 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > 17131056 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > 17130344 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > 17128432 bytes took 160ms
> > > > >
> > > > > new:
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > 120224
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > 123144
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > 17127928 bytes took 157ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > 17127144 bytes took 152ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > 17130352 bytes took 154ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > 17128544 bytes took 153ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > 17128664 bytes took 154ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > 17128328 bytes took 153ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > 17129288 bytes took 152ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > 17131056 bytes took 153ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > 17130344 bytes took 155ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > 17128432 bytes took 153ms
> > > > >
> > > > > I'm going to chunk the dirs to speed up the reads and throw it
> into a
> > > par
> > > > > iter.
> > > > >
> > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuataylorx@gmail.com
> >
> > > wrote:
> > > > >
> > > > > > Hi!
> > > > > >
> > > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > > parse
> > > > > > Arrow streaming files and convert them into CSV.
> > > > > >
> > > > > > Rust has been an absolute fantastic tool for this, the
> performance is
> > > > > > outstanding and I have had no issues using it for my use case.
> > > > > >
> > > > > > I would be happy to test out the branch and let you know what the
> > > > > > performance is like, as I was going to improve the current
> > > implementation
> > > > > > that i have for the CSV writer, as it takes a while for bigger
> > > datasets
> > > > > > (multi-GB).
> > > > > >
> > > > > > Josh
> > > > > >
> > > > > >
> > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org>
> wrote:
> > > > > >
> > > > > >> Andy Grove <an...@gmail.com> writes:
> > > > > >> >
> > > > > >> > Looking at this purely from the DataFusion/Ballista point of
> view,
> > > > > what
> > > > > >> I
> > > > > >> > would be interested in would be having a branch of DF that
> uses
> > > arrow2
> > > > > >> and
> > > > > >> > once that branch has all tests passing and can run queries
> with
> > > > > >> performance
> > > > > >> > that is at least as good as the original arrow crate, then cut
> > > over.
> > > > > >> >
> > > > > >> > However, for developers using the arrow APIs directly, I don't
> > > see an
> > > > > >> easy
> > > > > >> > path. We either try and gradually PR the changes in (which
> seems
> > > > > really
> > > > > >> > hard given that there are significant changes to APIs and
> internal
> > > > > data
> > > > > >> > structures) or we port some portion of the existing tests
> over to
> > > > > arrow2
> > > > > >> > and then make that the official crate once all test pass.
> > > > > >>
> > > > > >> How feasible would it be to make a legacy module in arrow2 that
> > > would
> > > > > >> enable (some large subset of) existing arrow users to try arrow2
> > > after
> > > > > >> adjusting their use statements? (That is, implement the
> > > public-facing
> > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> This
> > > would
> > > > > >> make it easier to test with DataFusion/Ballista and external
> users
> > > of
> > > > > the
> > > > > >> current arrow crate, then cut over and let those packages update
> > > > > >> incrementally from legacy to modern arrow2.
> > > > > >>
> > > > > >> I think it would be okay to tolerate some performance
> degradation
> > > when
> > > > > >> working through these legacy interfaces,so long as there was
> > > confidence
> > > > > >> that modernizing the callers would recover the performance (as
> tests
> > > > > have
> > > > > >> been showing).
> > > > > >>
> > > > > >
> > > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

Great, thanks for the update and pushing this forward. Let us know if
you need help with anything.

On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Hi,
>
> Wes and Neils,
>
> Thank you for your feedback and offer. I have created the two .xml reports:
>
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
>
> I based them on the report for Ballista. I also requested, on the PRs
> [1,2], clarification wrt to every contributors' contributions to each.
>
> Best,
> Jorge
>
> [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
>
>
>
> On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > <jo...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Thanks a lot for your feedback. I agree with all the arguments put
> > forward,
> > > including Andrew's point about the large change.
> > >
> > > I tried a gradual 4 months ago, but it was really difficult and I gave
> > up.
> > > I estimate that the work involved is half the work of writing parquet2
> > and
> > > arrow2 in the first place. The internal dependency on ArrayData (the main
> > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > components
> > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > compute, SIMD). I personally do not have the motivation to do it, though.
> > >
> > > Jed, the public API changes are small for end users. A typical migration
> > is
> > > [1]. I agree that we can further reduce the change-set by keeping legacy
> > > interfaces available.
> > >
> > > Andy, on my machine, the current benchmarks on query 1 yield:
> > >
> > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > memory (-m): 332.9, 239.6
> > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > parquet format: 1316.1, 930.7
> > > tbl format: 5297.3, 5383.1
> > >
> > > i.e. I am observing some improvements. Queries with joins are still
> > slower.
> > > The pruning of parquet groups and pages based on stats are not yet
> > there; I
> > > am working on them.
> > >
> > > I agree that this should go through IP clearance. I will start this
> > > process. My thinking would be to create two empty repos on apache/*, and
> > > create 2 PRs from the main branches of each of my repos to those repos,
> > and
> > > only merge them once IP is cleared. Would that be a reasonable process,
> > Wes?
> >
> > This sounds plenty fine to me — I'm happy to assist with the IP
> > clearance process having done it several times in the past. I don't
> > have an opinion about the names, but having experimental- in the name
> > sounds in line with the previous discussion we had about this.
> >
> > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > >
> > > Best,
> > > Jorge
> > >
> > > [1]
> > >
> > https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > >
> > >
> > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com>
> > wrote:
> > >
> > > > I played around with it, for my use case I really like the new way of
> > > > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > > > function as well.
> > > >
> > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > read a
> > > > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > > > parsing of lots of files, but it's pretty quick per file.
> > > >
> > > > old:
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > 120224
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > 123144
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > 17127928 bytes took 159ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > 17127144 bytes took 160ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > 17130352 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > 17128544 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > 17128664 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > 17128328 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > 17129288 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > 17131056 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > 17130344 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > 17128432 bytes took 160ms
> > > >
> > > > new:
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > 120224
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > 123144
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > 17127928 bytes took 157ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > 17127144 bytes took 152ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > 17130352 bytes took 154ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > 17128544 bytes took 153ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > 17128664 bytes took 154ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > 17128328 bytes took 153ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > 17129288 bytes took 152ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > 17131056 bytes took 153ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > 17130344 bytes took 155ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > 17128432 bytes took 153ms
> > > >
> > > > I'm going to chunk the dirs to speed up the reads and throw it into a
> > par
> > > > iter.
> > > >
> > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com>
> > wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > parse
> > > > > Arrow streaming files and convert them into CSV.
> > > > >
> > > > > Rust has been an absolute fantastic tool for this, the performance is
> > > > > outstanding and I have had no issues using it for my use case.
> > > > >
> > > > > I would be happy to test out the branch and let you know what the
> > > > > performance is like, as I was going to improve the current
> > implementation
> > > > > that i have for the CSV writer, as it takes a while for bigger
> > datasets
> > > > > (multi-GB).
> > > > >
> > > > > Josh
> > > > >
> > > > >
> > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> > > > >
> > > > >> Andy Grove <an...@gmail.com> writes:
> > > > >> >
> > > > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > > > what
> > > > >> I
> > > > >> > would be interested in would be having a branch of DF that uses
> > arrow2
> > > > >> and
> > > > >> > once that branch has all tests passing and can run queries with
> > > > >> performance
> > > > >> > that is at least as good as the original arrow crate, then cut
> > over.
> > > > >> >
> > > > >> > However, for developers using the arrow APIs directly, I don't
> > see an
> > > > >> easy
> > > > >> > path. We either try and gradually PR the changes in (which seems
> > > > really
> > > > >> > hard given that there are significant changes to APIs and internal
> > > > data
> > > > >> > structures) or we port some portion of the existing tests over to
> > > > arrow2
> > > > >> > and then make that the official crate once all test pass.
> > > > >>
> > > > >> How feasible would it be to make a legacy module in arrow2 that
> > would
> > > > >> enable (some large subset of) existing arrow users to try arrow2
> > after
> > > > >> adjusting their use statements? (That is, implement the
> > public-facing
> > > > >> legacy interfaces in terms of arrow2's new, safe interface.) This
> > would
> > > > >> make it easier to test with DataFusion/Ballista and external users
> > of
> > > > the
> > > > >> current arrow crate, then cut over and let those packages update
> > > > >> incrementally from legacy to modern arrow2.
> > > > >>
> > > > >> I think it would be okay to tolerate some performance degradation
> > when
> > > > >> working through these legacy interfaces,so long as there was
> > confidence
> > > > >> that modernizing the callers would recover the performance (as tests
> > > > have
> > > > >> been showing).
> > > > >>
> > > > >
> > > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi,

Wes and Neils,

Thank you for your feedback and offer. I have created the two .xml reports:

http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml

I based them on the report for Ballista. I also requested, on the PRs
[1,2], clarification wrt to every contributors' contributions to each.

Best,
Jorge

[1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
[2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1



On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:

> On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> <jo...@gmail.com> wrote:
> >
> > Hi,
> >
> > Thanks a lot for your feedback. I agree with all the arguments put
> forward,
> > including Andrew's point about the large change.
> >
> > I tried a gradual 4 months ago, but it was really difficult and I gave
> up.
> > I estimate that the work involved is half the work of writing parquet2
> and
> > arrow2 in the first place. The internal dependency on ArrayData (the main
> > culprit of the unsafe) on arrow-rs is so prevalent that all core
> components
> > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > compute, SIMD). I personally do not have the motivation to do it, though.
> >
> > Jed, the public API changes are small for end users. A typical migration
> is
> > [1]. I agree that we can further reduce the change-set by keeping legacy
> > interfaces available.
> >
> > Andy, on my machine, the current benchmarks on query 1 yield:
> >
> > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > memory (-m): 332.9, 239.6
> > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > parquet format: 1316.1, 930.7
> > tbl format: 5297.3, 5383.1
> >
> > i.e. I am observing some improvements. Queries with joins are still
> slower.
> > The pruning of parquet groups and pages based on stats are not yet
> there; I
> > am working on them.
> >
> > I agree that this should go through IP clearance. I will start this
> > process. My thinking would be to create two empty repos on apache/*, and
> > create 2 PRs from the main branches of each of my repos to those repos,
> and
> > only merge them once IP is cleared. Would that be a reasonable process,
> Wes?
>
> This sounds plenty fine to me — I'm happy to assist with the IP
> clearance process having done it several times in the past. I don't
> have an opinion about the names, but having experimental- in the name
> sounds in line with the previous discussion we had about this.
>
> > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> >
> > Best,
> > Jorge
> >
> > [1]
> >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > [2] https://github.com/apache/arrow-datafusion/pull/68
> >
> >
> > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com>
> wrote:
> >
> > > I played around with it, for my use case I really like the new way of
> > > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > > function as well.
> > >
> > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> read a
> > > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > > parsing of lots of files, but it's pretty quick per file.
> > >
> > > old:
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> 120224
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> 123144
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > 17127928 bytes took 159ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > 17127144 bytes took 160ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > 17130352 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > 17128544 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > 17128664 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > 17128328 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > 17129288 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > 17131056 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > 17130344 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > 17128432 bytes took 160ms
> > >
> > > new:
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> 120224
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> 123144
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > 17127928 bytes took 157ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > 17127144 bytes took 152ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > 17130352 bytes took 154ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > 17128544 bytes took 153ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > 17128664 bytes took 154ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > 17128328 bytes took 153ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > 17129288 bytes took 152ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > 17131056 bytes took 153ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > 17130344 bytes took 155ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > 17128432 bytes took 153ms
> > >
> > > I'm going to chunk the dirs to speed up the reads and throw it into a
> par
> > > iter.
> > >
> > > On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com>
> wrote:
> > >
> > > > Hi!
> > > >
> > > > I've been using arrow/arrow-rs for a while now, my use case is to
> parse
> > > > Arrow streaming files and convert them into CSV.
> > > >
> > > > Rust has been an absolute fantastic tool for this, the performance is
> > > > outstanding and I have had no issues using it for my use case.
> > > >
> > > > I would be happy to test out the branch and let you know what the
> > > > performance is like, as I was going to improve the current
> implementation
> > > > that i have for the CSV writer, as it takes a while for bigger
> datasets
> > > > (multi-GB).
> > > >
> > > > Josh
> > > >
> > > >
> > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> > > >
> > > >> Andy Grove <an...@gmail.com> writes:
> > > >> >
> > > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > > what
> > > >> I
> > > >> > would be interested in would be having a branch of DF that uses
> arrow2
> > > >> and
> > > >> > once that branch has all tests passing and can run queries with
> > > >> performance
> > > >> > that is at least as good as the original arrow crate, then cut
> over.
> > > >> >
> > > >> > However, for developers using the arrow APIs directly, I don't
> see an
> > > >> easy
> > > >> > path. We either try and gradually PR the changes in (which seems
> > > really
> > > >> > hard given that there are significant changes to APIs and internal
> > > data
> > > >> > structures) or we port some portion of the existing tests over to
> > > arrow2
> > > >> > and then make that the official crate once all test pass.
> > > >>
> > > >> How feasible would it be to make a legacy module in arrow2 that
> would
> > > >> enable (some large subset of) existing arrow users to try arrow2
> after
> > > >> adjusting their use statements? (That is, implement the
> public-facing
> > > >> legacy interfaces in terms of arrow2's new, safe interface.) This
> would
> > > >> make it easier to test with DataFusion/Ballista and external users
> of
> > > the
> > > >> current arrow crate, then cut over and let those packages update
> > > >> incrementally from legacy to modern arrow2.
> > > >>
> > > >> I think it would be okay to tolerate some performance degradation
> when
> > > >> working through these legacy interfaces,so long as there was
> confidence
> > > >> that modernizing the callers would recover the performance (as tests
> > > have
> > > >> been showing).
> > > >>
> > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Hi,
>
> Thanks a lot for your feedback. I agree with all the arguments put forward,
> including Andrew's point about the large change.
>
> I tried a gradual 4 months ago, but it was really difficult and I gave up.
> I estimate that the work involved is half the work of writing parquet2 and
> arrow2 in the first place. The internal dependency on ArrayData (the main
> culprit of the unsafe) on arrow-rs is so prevalent that all core components
> need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> compute, SIMD). I personally do not have the motivation to do it, though.
>
> Jed, the public API changes are small for end users. A typical migration is
> [1]. I agree that we can further reduce the change-set by keeping legacy
> interfaces available.
>
> Andy, on my machine, the current benchmarks on query 1 yield:
>
> type, master (ms), PR [2] for arrow2+parquet2 (ms)
> memory (-m): 332.9, 239.6
> load (the initial time in -m with --format parquet): 5286.0, 3043.0
> parquet format: 1316.1, 930.7
> tbl format: 5297.3, 5383.1
>
> i.e. I am observing some improvements. Queries with joins are still slower.
> The pruning of parquet groups and pages based on stats are not yet there; I
> am working on them.
>
> I agree that this should go through IP clearance. I will start this
> process. My thinking would be to create two empty repos on apache/*, and
> create 2 PRs from the main branches of each of my repos to those repos, and
> only merge them once IP is cleared. Would that be a reasonable process, Wes?

This sounds plenty fine to me — I'm happy to assist with the IP
clearance process having done it several times in the past. I don't
have an opinion about the names, but having experimental- in the name
sounds in line with the previous discussion we had about this.

> Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
>
> Best,
> Jorge
>
> [1]
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> [2] https://github.com/apache/arrow-datafusion/pull/68
>
>
> On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com> wrote:
>
> > I played around with it, for my use case I really like the new way of
> > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > function as well.
> >
> > I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
> > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > parsing of lots of files, but it's pretty quick per file.
> >
> > old:
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > 17127928 bytes took 159ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > 17127144 bytes took 160ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > 17130352 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > 17128544 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > 17128664 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > 17128328 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > 17129288 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > 17131056 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > 17130344 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > 17128432 bytes took 160ms
> >
> > new:
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > 17127928 bytes took 157ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > 17127144 bytes took 152ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > 17130352 bytes took 154ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > 17128544 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > 17128664 bytes took 154ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > 17128328 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > 17129288 bytes took 152ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > 17131056 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > 17130344 bytes took 155ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > 17128432 bytes took 153ms
> >
> > I'm going to chunk the dirs to speed up the reads and throw it into a par
> > iter.
> >
> > On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com> wrote:
> >
> > > Hi!
> > >
> > > I've been using arrow/arrow-rs for a while now, my use case is to parse
> > > Arrow streaming files and convert them into CSV.
> > >
> > > Rust has been an absolute fantastic tool for this, the performance is
> > > outstanding and I have had no issues using it for my use case.
> > >
> > > I would be happy to test out the branch and let you know what the
> > > performance is like, as I was going to improve the current implementation
> > > that i have for the CSV writer, as it takes a while for bigger datasets
> > > (multi-GB).
> > >
> > > Josh
> > >
> > >
> > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> > >
> > >> Andy Grove <an...@gmail.com> writes:
> > >> >
> > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > what
> > >> I
> > >> > would be interested in would be having a branch of DF that uses arrow2
> > >> and
> > >> > once that branch has all tests passing and can run queries with
> > >> performance
> > >> > that is at least as good as the original arrow crate, then cut over.
> > >> >
> > >> > However, for developers using the arrow APIs directly, I don't see an
> > >> easy
> > >> > path. We either try and gradually PR the changes in (which seems
> > really
> > >> > hard given that there are significant changes to APIs and internal
> > data
> > >> > structures) or we port some portion of the existing tests over to
> > arrow2
> > >> > and then make that the official crate once all test pass.
> > >>
> > >> How feasible would it be to make a legacy module in arrow2 that would
> > >> enable (some large subset of) existing arrow users to try arrow2 after
> > >> adjusting their use statements? (That is, implement the public-facing
> > >> legacy interfaces in terms of arrow2's new, safe interface.) This would
> > >> make it easier to test with DataFusion/Ballista and external users of
> > the
> > >> current arrow crate, then cut over and let those packages update
> > >> incrementally from legacy to modern arrow2.
> > >>
> > >> I think it would be okay to tolerate some performance degradation when
> > >> working through these legacy interfaces,so long as there was confidence
> > >> that modernizing the callers would recover the performance (as tests
> > have
> > >> been showing).
> > >>
> > >
> >