You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2021/02/07 13:42:16 UTC

[Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Hi,

Over the past 4 months, I have been growing more and more frustrated by the
amount of undefined behaviour that I am finding and fixing on the Rust
implementation. I would like to open the discussion of a broader overview
about the problem in light of our current knowledge and what Rust enables
as well as offer a solution to the bigger problem.

Just to give you a gist of the seriousness of the issue, the following
currently compiles, runs, and is undefined behavior in Rust:

let buffer = Buffer::from(&[0i32, 2i32]);let data =
ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
vec![]);let array = Float64Array::from(Arc::new(data));
println!("{:?}", array.value(1));

I would like to propose a major refactor of the crate around physical
traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
compile time, thereby avoiding things like the example above from happening
again.

So far, I was able to reproduce all core features of the arrow crate
(nested types, dynamic typing, FFI, memory alignment, performance) by using
`Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
RawPointer altogether.

Safety-wise, it significantly limits the usage of `unsafe` on higher end
APIs, it has a single transmute (the bit chunk iterator one), and a
guaranteed safe public API (which is not the case in our master, as shown
above).

Performance wise, it yields a 1.3x improvement over the current master
(after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
should have other major performance improvements.

API wise, it simplifies the traits that we have for memory layout as well
as the handling of bitmaps, offsets, etc.

The proposal is drafted as a README
<https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
repo that I created specifically for this from the ground up, and the full
set of changes are in a PR <https://github.com/jorgecarleitao/arrow2/pull/1>
so that anyone can view and comment on it. I haven't made any PR to master
because this is too large to track as a diff against master, and is beyond
the point, anyways.

I haven't ported most of the crate as I only tried the non-trivial features
(memory layout, bitmaps, FFI, dynamic typing, nested types).

I would highly appreciate your thoughts about it.

Best,
Jorge

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Awesome. Thanks Wes.

I have now initiated the vote for both projects.

Best,
Jorge


On Sat, Jul 10, 2021 at 1:26 PM Wes McKinney <we...@gmail.com> wrote:

> The process for updating the website is described on
>
> https://incubator.apache.org/guides/website.html
>
> It looks like you need to add the new entries to the index.xml file
> and then trigger a website build (which should be triggered by changes
> to SVN, but if not you can trigger one manually through Jenkins).
>
> After the new IP clearance pages are visible you should send an IP
> clearance lazy consensus vote to general@incubator.apache.org like
>
>
> https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E
>
> On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão
>
> <jo...@gmail.com> wrote:
> >
> > Thanks a lot Wes,
> >
> > I am not sure how to proceed from here:
> >
> > 1. how do we generate the html from the xml? I.e.
> > https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
> > 2. how do I trigger the the process to start? can I just email the
> > incubator with the proposal?
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Great, thanks for the update and pushing this forward. Let us know if
> > > you need help with anything.
> > >
> > > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Wes and Neils,
> > > >
> > > > Thank you for your feedback and offer. I have created the two .xml
> > > reports:
> > > >
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> > > >
> > > > I based them on the report for Ballista. I also requested, on the PRs
> > > > [1,2], clarification wrt to every contributors' contributions to
> each.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > > >
> > > >
> > > >
> > > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > > > <jo...@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks a lot for your feedback. I agree with all the arguments
> put
> > > > > forward,
> > > > > > including Andrew's point about the large change.
> > > > > >
> > > > > > I tried a gradual 4 months ago, but it was really difficult and I
> > > gave
> > > > > up.
> > > > > > I estimate that the work involved is half the work of writing
> > > parquet2
> > > > > and
> > > > > > arrow2 in the first place. The internal dependency on ArrayData
> (the
> > > main
> > > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > > > components
> > > > > > need to be re-written from scratch (IPC, FFI, IO,
> array/transform/*,
> > > > > > compute, SIMD). I personally do not have the motivation to do it,
> > > though.
> > > > > >
> > > > > > Jed, the public API changes are small for end users. A typical
> > > migration
> > > > > is
> > > > > > [1]. I agree that we can further reduce the change-set by keeping
> > > legacy
> > > > > > interfaces available.
> > > > > >
> > > > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > > > >
> > > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > > > memory (-m): 332.9, 239.6
> > > > > > load (the initial time in -m with --format parquet): 5286.0,
> 3043.0
> > > > > > parquet format: 1316.1, 930.7
> > > > > > tbl format: 5297.3, 5383.1
> > > > > >
> > > > > > i.e. I am observing some improvements. Queries with joins are
> still
> > > > > slower.
> > > > > > The pruning of parquet groups and pages based on stats are not
> yet
> > > > > there; I
> > > > > > am working on them.
> > > > > >
> > > > > > I agree that this should go through IP clearance. I will start
> this
> > > > > > process. My thinking would be to create two empty repos on
> apache/*,
> > > and
> > > > > > create 2 PRs from the main branches of each of my repos to those
> > > repos,
> > > > > and
> > > > > > only merge them once IP is cleared. Would that be a reasonable
> > > process,
> > > > > Wes?
> > > > >
> > > > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > > > clearance process having done it several times in the past. I don't
> > > > > have an opinion about the names, but having experimental- in the
> name
> > > > > sounds in line with the previous discussion we had about this.
> > > > >
> > > > > > Names: arrow-experimental-rs2 and
> arrow-experimental-rs-parquet2, or?
> > > > > >
> > > > > > Best,
> > > > > > Jorge
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > > > >
> > > > > >
> > > > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <
> joshuataylorx@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > I played around with it, for my use case I really like the new
> way
> > > of
> > > > > > > writing CSVs, it's much more obvious. I love the
> > > `read_stream_metadata`
> > > > > > > function as well.
> > > > > > >
> > > > > > > I'm seeing a very slight speed (~8ms) improvement on my end,
> but I
> > > > > read a
> > > > > > > bunch of files in a directory and spit out a CSV, the
> bottleneck
> > > is the
> > > > > > > parsing of lots of files, but it's pretty quick per file.
> > > > > > >
> > > > > > > old:
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > > 120224
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > > 123144
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > > 17127928 bytes took 159ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > > 17127144 bytes took 160ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > > 17130352 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > > 17128544 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > > 17128664 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > > 17128328 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > > 17129288 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > > 17131056 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > > 17130344 bytes took 158ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > > 17128432 bytes took 160ms
> > > > > > >
> > > > > > > new:
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > > 120224
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > > 123144
> > > > > > > bytes took 1ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > > 17127928 bytes took 157ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > > 17127144 bytes took 152ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > > 17130352 bytes took 154ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > > 17128544 bytes took 153ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > > 17128664 bytes took 154ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > > 17128328 bytes took 153ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > > 17129288 bytes took 152ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > > 17131056 bytes took 153ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > > 17130344 bytes took 155ms
> > > > > > >
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > > 17128432 bytes took 153ms
> > > > > > >
> > > > > > > I'm going to chunk the dirs to speed up the reads and throw it
> > > into a
> > > > > par
> > > > > > > iter.
> > > > > > >
> > > > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <
> joshuataylorx@gmail.com
> > > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I've been using arrow/arrow-rs for a while now, my use case
> is to
> > > > > parse
> > > > > > > > Arrow streaming files and convert them into CSV.
> > > > > > > >
> > > > > > > > Rust has been an absolute fantastic tool for this, the
> > > performance is
> > > > > > > > outstanding and I have had no issues using it for my use
> case.
> > > > > > > >
> > > > > > > > I would be happy to test out the branch and let you know
> what the
> > > > > > > > performance is like, as I was going to improve the current
> > > > > implementation
> > > > > > > > that i have for the CSV writer, as it takes a while for
> bigger
> > > > > datasets
> > > > > > > > (multi-GB).
> > > > > > > >
> > > > > > > > Josh
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org>
> > > wrote:
> > > > > > > >
> > > > > > > >> Andy Grove <an...@gmail.com> writes:
> > > > > > > >> >
> > > > > > > >> > Looking at this purely from the DataFusion/Ballista point
> of
> > > view,
> > > > > > > what
> > > > > > > >> I
> > > > > > > >> > would be interested in would be having a branch of DF that
> > > uses
> > > > > arrow2
> > > > > > > >> and
> > > > > > > >> > once that branch has all tests passing and can run queries
> > > with
> > > > > > > >> performance
> > > > > > > >> > that is at least as good as the original arrow crate,
> then cut
> > > > > over.
> > > > > > > >> >
> > > > > > > >> > However, for developers using the arrow APIs directly, I
> don't
> > > > > see an
> > > > > > > >> easy
> > > > > > > >> > path. We either try and gradually PR the changes in (which
> > > seems
> > > > > > > really
> > > > > > > >> > hard given that there are significant changes to APIs and
> > > internal
> > > > > > > data
> > > > > > > >> > structures) or we port some portion of the existing tests
> > > over to
> > > > > > > arrow2
> > > > > > > >> > and then make that the official crate once all test pass.
> > > > > > > >>
> > > > > > > >> How feasible would it be to make a legacy module in arrow2
> that
> > > > > would
> > > > > > > >> enable (some large subset of) existing arrow users to try
> arrow2
> > > > > after
> > > > > > > >> adjusting their use statements? (That is, implement the
> > > > > public-facing
> > > > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> > > This
> > > > > would
> > > > > > > >> make it easier to test with DataFusion/Ballista and external
> > > users
> > > > > of
> > > > > > > the
> > > > > > > >> current arrow crate, then cut over and let those packages
> update
> > > > > > > >> incrementally from legacy to modern arrow2.
> > > > > > > >>
> > > > > > > >> I think it would be okay to tolerate some performance
> > > degradation
> > > > > when
> > > > > > > >> working through these legacy interfaces,so long as there was
> > > > > confidence
> > > > > > > >> that modernizing the callers would recover the performance
> (as
> > > tests
> > > > > > > have
> > > > > > > >> been showing).
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

The process for updating the website is described on

https://incubator.apache.org/guides/website.html

It looks like you need to add the new entries to the index.xml file
and then trigger a website build (which should be triggered by changes
to SVN, but if not you can trigger one manually through Jenkins).

After the new IP clearance pages are visible you should send an IP
clearance lazy consensus vote to general@incubator.apache.org like

https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E

On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão

<jo...@gmail.com> wrote:
>
> Thanks a lot Wes,
>
> I am not sure how to proceed from here:
>
> 1. how do we generate the html from the xml? I.e.
> https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
> 2. how do I trigger the the process to start? can I just email the
> incubator with the proposal?
>
> Best,
> Jorge
>
>
>
> On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Great, thanks for the update and pushing this forward. Let us know if
> > you need help with anything.
> >
> > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> > <jo...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Wes and Neils,
> > >
> > > Thank you for your feedback and offer. I have created the two .xml
> > reports:
> > >
> > >
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> > >
> > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> > >
> > > I based them on the report for Ballista. I also requested, on the PRs
> > > [1,2], clarification wrt to every contributors' contributions to each.
> > >
> > > Best,
> > > Jorge
> > >
> > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > >
> > >
> > >
> > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > > <jo...@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > > forward,
> > > > > including Andrew's point about the large change.
> > > > >
> > > > > I tried a gradual 4 months ago, but it was really difficult and I
> > gave
> > > > up.
> > > > > I estimate that the work involved is half the work of writing
> > parquet2
> > > > and
> > > > > arrow2 in the first place. The internal dependency on ArrayData (the
> > main
> > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > > components
> > > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > > compute, SIMD). I personally do not have the motivation to do it,
> > though.
> > > > >
> > > > > Jed, the public API changes are small for end users. A typical
> > migration
> > > > is
> > > > > [1]. I agree that we can further reduce the change-set by keeping
> > legacy
> > > > > interfaces available.
> > > > >
> > > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > > >
> > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > > memory (-m): 332.9, 239.6
> > > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > > parquet format: 1316.1, 930.7
> > > > > tbl format: 5297.3, 5383.1
> > > > >
> > > > > i.e. I am observing some improvements. Queries with joins are still
> > > > slower.
> > > > > The pruning of parquet groups and pages based on stats are not yet
> > > > there; I
> > > > > am working on them.
> > > > >
> > > > > I agree that this should go through IP clearance. I will start this
> > > > > process. My thinking would be to create two empty repos on apache/*,
> > and
> > > > > create 2 PRs from the main branches of each of my repos to those
> > repos,
> > > > and
> > > > > only merge them once IP is cleared. Would that be a reasonable
> > process,
> > > > Wes?
> > > >
> > > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > > clearance process having done it several times in the past. I don't
> > > > have an opinion about the names, but having experimental- in the name
> > > > sounds in line with the previous discussion we had about this.
> > > >
> > > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > > >
> > > > >
> > > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <joshuataylorx@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > I played around with it, for my use case I really like the new way
> > of
> > > > > > writing CSVs, it's much more obvious. I love the
> > `read_stream_metadata`
> > > > > > function as well.
> > > > > >
> > > > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > > > read a
> > > > > > bunch of files in a directory and spit out a CSV, the bottleneck
> > is the
> > > > > > parsing of lots of files, but it's pretty quick per file.
> > > > > >
> > > > > > old:
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > 120224
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > 123144
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > 17127928 bytes took 159ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > 17127144 bytes took 160ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > 17130352 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > 17128544 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > 17128664 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > 17128328 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > 17129288 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > 17131056 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > 17130344 bytes took 158ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > 17128432 bytes took 160ms
> > > > > >
> > > > > > new:
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > > 120224
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > > 123144
> > > > > > bytes took 1ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > > 17127928 bytes took 157ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > > 17127144 bytes took 152ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > > 17130352 bytes took 154ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > > 17128544 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > > 17128664 bytes took 154ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > > 17128328 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > > 17129288 bytes took 152ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > > 17131056 bytes took 153ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > > 17130344 bytes took 155ms
> > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > > 17128432 bytes took 153ms
> > > > > >
> > > > > > I'm going to chunk the dirs to speed up the reads and throw it
> > into a
> > > > par
> > > > > > iter.
> > > > > >
> > > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuataylorx@gmail.com
> > >
> > > > wrote:
> > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > > > parse
> > > > > > > Arrow streaming files and convert them into CSV.
> > > > > > >
> > > > > > > Rust has been an absolute fantastic tool for this, the
> > performance is
> > > > > > > outstanding and I have had no issues using it for my use case.
> > > > > > >
> > > > > > > I would be happy to test out the branch and let you know what the
> > > > > > > performance is like, as I was going to improve the current
> > > > implementation
> > > > > > > that i have for the CSV writer, as it takes a while for bigger
> > > > datasets
> > > > > > > (multi-GB).
> > > > > > >
> > > > > > > Josh
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org>
> > wrote:
> > > > > > >
> > > > > > >> Andy Grove <an...@gmail.com> writes:
> > > > > > >> >
> > > > > > >> > Looking at this purely from the DataFusion/Ballista point of
> > view,
> > > > > > what
> > > > > > >> I
> > > > > > >> > would be interested in would be having a branch of DF that
> > uses
> > > > arrow2
> > > > > > >> and
> > > > > > >> > once that branch has all tests passing and can run queries
> > with
> > > > > > >> performance
> > > > > > >> > that is at least as good as the original arrow crate, then cut
> > > > over.
> > > > > > >> >
> > > > > > >> > However, for developers using the arrow APIs directly, I don't
> > > > see an
> > > > > > >> easy
> > > > > > >> > path. We either try and gradually PR the changes in (which
> > seems
> > > > > > really
> > > > > > >> > hard given that there are significant changes to APIs and
> > internal
> > > > > > data
> > > > > > >> > structures) or we port some portion of the existing tests
> > over to
> > > > > > arrow2
> > > > > > >> > and then make that the official crate once all test pass.
> > > > > > >>
> > > > > > >> How feasible would it be to make a legacy module in arrow2 that
> > > > would
> > > > > > >> enable (some large subset of) existing arrow users to try arrow2
> > > > after
> > > > > > >> adjusting their use statements? (That is, implement the
> > > > public-facing
> > > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> > This
> > > > would
> > > > > > >> make it easier to test with DataFusion/Ballista and external
> > users
> > > > of
> > > > > > the
> > > > > > >> current arrow crate, then cut over and let those packages update
> > > > > > >> incrementally from legacy to modern arrow2.
> > > > > > >>
> > > > > > >> I think it would be okay to tolerate some performance
> > degradation
> > > > when
> > > > > > >> working through these legacy interfaces,so long as there was
> > > > confidence
> > > > > > >> that modernizing the callers would recover the performance (as
> > tests
> > > > > > have
> > > > > > >> been showing).
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Thanks a lot Wes,

I am not sure how to proceed from here:

1. how do we generate the html from the xml? I.e.
https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
2. how do I trigger the the process to start? can I just email the
incubator with the proposal?

Best,
Jorge



On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <we...@gmail.com> wrote:

> Great, thanks for the update and pushing this forward. Let us know if
> you need help with anything.
>
> On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
> <jo...@gmail.com> wrote:
> >
> > Hi,
> >
> > Wes and Neils,
> >
> > Thank you for your feedback and offer. I have created the two .xml
> reports:
> >
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> >
> > I based them on the report for Ballista. I also requested, on the PRs
> > [1,2], clarification wrt to every contributors' contributions to each.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> >
> >
> >
> > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > forward,
> > > > including Andrew's point about the large change.
> > > >
> > > > I tried a gradual 4 months ago, but it was really difficult and I
> gave
> > > up.
> > > > I estimate that the work involved is half the work of writing
> parquet2
> > > and
> > > > arrow2 in the first place. The internal dependency on ArrayData (the
> main
> > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > components
> > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > compute, SIMD). I personally do not have the motivation to do it,
> though.
> > > >
> > > > Jed, the public API changes are small for end users. A typical
> migration
> > > is
> > > > [1]. I agree that we can further reduce the change-set by keeping
> legacy
> > > > interfaces available.
> > > >
> > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > >
> > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > memory (-m): 332.9, 239.6
> > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > parquet format: 1316.1, 930.7
> > > > tbl format: 5297.3, 5383.1
> > > >
> > > > i.e. I am observing some improvements. Queries with joins are still
> > > slower.
> > > > The pruning of parquet groups and pages based on stats are not yet
> > > there; I
> > > > am working on them.
> > > >
> > > > I agree that this should go through IP clearance. I will start this
> > > > process. My thinking would be to create two empty repos on apache/*,
> and
> > > > create 2 PRs from the main branches of each of my repos to those
> repos,
> > > and
> > > > only merge them once IP is cleared. Would that be a reasonable
> process,
> > > Wes?
> > >
> > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > clearance process having done it several times in the past. I don't
> > > have an opinion about the names, but having experimental- in the name
> > > sounds in line with the previous discussion we had about this.
> > >
> > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1]
> > > >
> > >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > >
> > > >
> > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <joshuataylorx@gmail.com
> >
> > > wrote:
> > > >
> > > > > I played around with it, for my use case I really like the new way
> of
> > > > > writing CSVs, it's much more obvious. I love the
> `read_stream_metadata`
> > > > > function as well.
> > > > >
> > > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > > read a
> > > > > bunch of files in a directory and spit out a CSV, the bottleneck
> is the
> > > > > parsing of lots of files, but it's pretty quick per file.
> > > > >
> > > > > old:
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > 120224
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > 123144
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > 17127928 bytes took 159ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > 17127144 bytes took 160ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > 17130352 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > 17128544 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > 17128664 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > 17128328 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > 17129288 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > 17131056 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > 17130344 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > 17128432 bytes took 160ms
> > > > >
> > > > > new:
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > > 120224
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > > 123144
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > > 17127928 bytes took 157ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > > 17127144 bytes took 152ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > > 17130352 bytes took 154ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > > 17128544 bytes took 153ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > > 17128664 bytes took 154ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > > 17128328 bytes took 153ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > > 17129288 bytes took 152ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > > 17131056 bytes took 153ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > > 17130344 bytes took 155ms
> > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > > 17128432 bytes took 153ms
> > > > >
> > > > > I'm going to chunk the dirs to speed up the reads and throw it
> into a
> > > par
> > > > > iter.
> > > > >
> > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuataylorx@gmail.com
> >
> > > wrote:
> > > > >
> > > > > > Hi!
> > > > > >
> > > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > > parse
> > > > > > Arrow streaming files and convert them into CSV.
> > > > > >
> > > > > > Rust has been an absolute fantastic tool for this, the
> performance is
> > > > > > outstanding and I have had no issues using it for my use case.
> > > > > >
> > > > > > I would be happy to test out the branch and let you know what the
> > > > > > performance is like, as I was going to improve the current
> > > implementation
> > > > > > that i have for the CSV writer, as it takes a while for bigger
> > > datasets
> > > > > > (multi-GB).
> > > > > >
> > > > > > Josh
> > > > > >
> > > > > >
> > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org>
> wrote:
> > > > > >
> > > > > >> Andy Grove <an...@gmail.com> writes:
> > > > > >> >
> > > > > >> > Looking at this purely from the DataFusion/Ballista point of
> view,
> > > > > what
> > > > > >> I
> > > > > >> > would be interested in would be having a branch of DF that
> uses
> > > arrow2
> > > > > >> and
> > > > > >> > once that branch has all tests passing and can run queries
> with
> > > > > >> performance
> > > > > >> > that is at least as good as the original arrow crate, then cut
> > > over.
> > > > > >> >
> > > > > >> > However, for developers using the arrow APIs directly, I don't
> > > see an
> > > > > >> easy
> > > > > >> > path. We either try and gradually PR the changes in (which
> seems
> > > > > really
> > > > > >> > hard given that there are significant changes to APIs and
> internal
> > > > > data
> > > > > >> > structures) or we port some portion of the existing tests
> over to
> > > > > arrow2
> > > > > >> > and then make that the official crate once all test pass.
> > > > > >>
> > > > > >> How feasible would it be to make a legacy module in arrow2 that
> > > would
> > > > > >> enable (some large subset of) existing arrow users to try arrow2
> > > after
> > > > > >> adjusting their use statements? (That is, implement the
> > > public-facing
> > > > > >> legacy interfaces in terms of arrow2's new, safe interface.)
> This
> > > would
> > > > > >> make it easier to test with DataFusion/Ballista and external
> users
> > > of
> > > > > the
> > > > > >> current arrow crate, then cut over and let those packages update
> > > > > >> incrementally from legacy to modern arrow2.
> > > > > >>
> > > > > >> I think it would be okay to tolerate some performance
> degradation
> > > when
> > > > > >> working through these legacy interfaces,so long as there was
> > > confidence
> > > > > >> that modernizing the callers would recover the performance (as
> tests
> > > > > have
> > > > > >> been showing).
> > > > > >>
> > > > > >
> > > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

Great, thanks for the update and pushing this forward. Let us know if
you need help with anything.

On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Hi,
>
> Wes and Neils,
>
> Thank you for your feedback and offer. I have created the two .xml reports:
>
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
>
> I based them on the report for Ballista. I also requested, on the PRs
> [1,2], clarification wrt to every contributors' contributions to each.
>
> Best,
> Jorge
>
> [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
>
>
>
> On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > <jo...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Thanks a lot for your feedback. I agree with all the arguments put
> > forward,
> > > including Andrew's point about the large change.
> > >
> > > I tried a gradual 4 months ago, but it was really difficult and I gave
> > up.
> > > I estimate that the work involved is half the work of writing parquet2
> > and
> > > arrow2 in the first place. The internal dependency on ArrayData (the main
> > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > components
> > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > compute, SIMD). I personally do not have the motivation to do it, though.
> > >
> > > Jed, the public API changes are small for end users. A typical migration
> > is
> > > [1]. I agree that we can further reduce the change-set by keeping legacy
> > > interfaces available.
> > >
> > > Andy, on my machine, the current benchmarks on query 1 yield:
> > >
> > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > memory (-m): 332.9, 239.6
> > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > parquet format: 1316.1, 930.7
> > > tbl format: 5297.3, 5383.1
> > >
> > > i.e. I am observing some improvements. Queries with joins are still
> > slower.
> > > The pruning of parquet groups and pages based on stats are not yet
> > there; I
> > > am working on them.
> > >
> > > I agree that this should go through IP clearance. I will start this
> > > process. My thinking would be to create two empty repos on apache/*, and
> > > create 2 PRs from the main branches of each of my repos to those repos,
> > and
> > > only merge them once IP is cleared. Would that be a reasonable process,
> > Wes?
> >
> > This sounds plenty fine to me — I'm happy to assist with the IP
> > clearance process having done it several times in the past. I don't
> > have an opinion about the names, but having experimental- in the name
> > sounds in line with the previous discussion we had about this.
> >
> > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > >
> > > Best,
> > > Jorge
> > >
> > > [1]
> > >
> > https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > >
> > >
> > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com>
> > wrote:
> > >
> > > > I played around with it, for my use case I really like the new way of
> > > > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > > > function as well.
> > > >
> > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > read a
> > > > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > > > parsing of lots of files, but it's pretty quick per file.
> > > >
> > > > old:
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > 120224
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > 123144
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > 17127928 bytes took 159ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > 17127144 bytes took 160ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > 17130352 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > 17128544 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > 17128664 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > 17128328 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > 17129288 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > 17131056 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > 17130344 bytes took 158ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > 17128432 bytes took 160ms
> > > >
> > > > new:
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> > 120224
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> > 123144
> > > > bytes took 1ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > > 17127928 bytes took 157ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > > 17127144 bytes took 152ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > > 17130352 bytes took 154ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > > 17128544 bytes took 153ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > > 17128664 bytes took 154ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > > 17128328 bytes took 153ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > > 17129288 bytes took 152ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > > 17131056 bytes took 153ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > > 17130344 bytes took 155ms
> > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > > 17128432 bytes took 153ms
> > > >
> > > > I'm going to chunk the dirs to speed up the reads and throw it into a
> > par
> > > > iter.
> > > >
> > > > On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com>
> > wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > I've been using arrow/arrow-rs for a while now, my use case is to
> > parse
> > > > > Arrow streaming files and convert them into CSV.
> > > > >
> > > > > Rust has been an absolute fantastic tool for this, the performance is
> > > > > outstanding and I have had no issues using it for my use case.
> > > > >
> > > > > I would be happy to test out the branch and let you know what the
> > > > > performance is like, as I was going to improve the current
> > implementation
> > > > > that i have for the CSV writer, as it takes a while for bigger
> > datasets
> > > > > (multi-GB).
> > > > >
> > > > > Josh
> > > > >
> > > > >
> > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> > > > >
> > > > >> Andy Grove <an...@gmail.com> writes:
> > > > >> >
> > > > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > > > what
> > > > >> I
> > > > >> > would be interested in would be having a branch of DF that uses
> > arrow2
> > > > >> and
> > > > >> > once that branch has all tests passing and can run queries with
> > > > >> performance
> > > > >> > that is at least as good as the original arrow crate, then cut
> > over.
> > > > >> >
> > > > >> > However, for developers using the arrow APIs directly, I don't
> > see an
> > > > >> easy
> > > > >> > path. We either try and gradually PR the changes in (which seems
> > > > really
> > > > >> > hard given that there are significant changes to APIs and internal
> > > > data
> > > > >> > structures) or we port some portion of the existing tests over to
> > > > arrow2
> > > > >> > and then make that the official crate once all test pass.
> > > > >>
> > > > >> How feasible would it be to make a legacy module in arrow2 that
> > would
> > > > >> enable (some large subset of) existing arrow users to try arrow2
> > after
> > > > >> adjusting their use statements? (That is, implement the
> > public-facing
> > > > >> legacy interfaces in terms of arrow2's new, safe interface.) This
> > would
> > > > >> make it easier to test with DataFusion/Ballista and external users
> > of
> > > > the
> > > > >> current arrow crate, then cut over and let those packages update
> > > > >> incrementally from legacy to modern arrow2.
> > > > >>
> > > > >> I think it would be okay to tolerate some performance degradation
> > when
> > > > >> working through these legacy interfaces,so long as there was
> > confidence
> > > > >> that modernizing the callers would recover the performance (as tests
> > > > have
> > > > >> been showing).
> > > > >>
> > > > >
> > > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi,

Wes and Neils,

Thank you for your feedback and offer. I have created the two .xml reports:

http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml

I based them on the report for Ballista. I also requested, on the PRs
[1,2], clarification wrt to every contributors' contributions to each.

Best,
Jorge

[1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
[2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1



On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:

> On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> <jo...@gmail.com> wrote:
> >
> > Hi,
> >
> > Thanks a lot for your feedback. I agree with all the arguments put
> forward,
> > including Andrew's point about the large change.
> >
> > I tried a gradual 4 months ago, but it was really difficult and I gave
> up.
> > I estimate that the work involved is half the work of writing parquet2
> and
> > arrow2 in the first place. The internal dependency on ArrayData (the main
> > culprit of the unsafe) on arrow-rs is so prevalent that all core
> components
> > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > compute, SIMD). I personally do not have the motivation to do it, though.
> >
> > Jed, the public API changes are small for end users. A typical migration
> is
> > [1]. I agree that we can further reduce the change-set by keeping legacy
> > interfaces available.
> >
> > Andy, on my machine, the current benchmarks on query 1 yield:
> >
> > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > memory (-m): 332.9, 239.6
> > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > parquet format: 1316.1, 930.7
> > tbl format: 5297.3, 5383.1
> >
> > i.e. I am observing some improvements. Queries with joins are still
> slower.
> > The pruning of parquet groups and pages based on stats are not yet
> there; I
> > am working on them.
> >
> > I agree that this should go through IP clearance. I will start this
> > process. My thinking would be to create two empty repos on apache/*, and
> > create 2 PRs from the main branches of each of my repos to those repos,
> and
> > only merge them once IP is cleared. Would that be a reasonable process,
> Wes?
>
> This sounds plenty fine to me — I'm happy to assist with the IP
> clearance process having done it several times in the past. I don't
> have an opinion about the names, but having experimental- in the name
> sounds in line with the previous discussion we had about this.
>
> > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> >
> > Best,
> > Jorge
> >
> > [1]
> >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > [2] https://github.com/apache/arrow-datafusion/pull/68
> >
> >
> > On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com>
> wrote:
> >
> > > I played around with it, for my use case I really like the new way of
> > > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > > function as well.
> > >
> > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> read a
> > > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > > parsing of lots of files, but it's pretty quick per file.
> > >
> > > old:
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> 120224
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> 123144
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > 17127928 bytes took 159ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > 17127144 bytes took 160ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > 17130352 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > 17128544 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > 17128664 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > 17128328 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > 17129288 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > 17131056 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > 17130344 bytes took 158ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > 17128432 bytes took 160ms
> > >
> > > new:
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0
> 120224
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1
> 123144
> > > bytes took 1ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > > 17127928 bytes took 157ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > > 17127144 bytes took 152ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > > 17130352 bytes took 154ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > > 17128544 bytes took 153ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > > 17128664 bytes took 154ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > > 17128328 bytes took 153ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > > 17129288 bytes took 152ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > > 17131056 bytes took 153ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > > 17130344 bytes took 155ms
> > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > > 17128432 bytes took 153ms
> > >
> > > I'm going to chunk the dirs to speed up the reads and throw it into a
> par
> > > iter.
> > >
> > > On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com>
> wrote:
> > >
> > > > Hi!
> > > >
> > > > I've been using arrow/arrow-rs for a while now, my use case is to
> parse
> > > > Arrow streaming files and convert them into CSV.
> > > >
> > > > Rust has been an absolute fantastic tool for this, the performance is
> > > > outstanding and I have had no issues using it for my use case.
> > > >
> > > > I would be happy to test out the branch and let you know what the
> > > > performance is like, as I was going to improve the current
> implementation
> > > > that i have for the CSV writer, as it takes a while for bigger
> datasets
> > > > (multi-GB).
> > > >
> > > > Josh
> > > >
> > > >
> > > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> > > >
> > > >> Andy Grove <an...@gmail.com> writes:
> > > >> >
> > > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > > what
> > > >> I
> > > >> > would be interested in would be having a branch of DF that uses
> arrow2
> > > >> and
> > > >> > once that branch has all tests passing and can run queries with
> > > >> performance
> > > >> > that is at least as good as the original arrow crate, then cut
> over.
> > > >> >
> > > >> > However, for developers using the arrow APIs directly, I don't
> see an
> > > >> easy
> > > >> > path. We either try and gradually PR the changes in (which seems
> > > really
> > > >> > hard given that there are significant changes to APIs and internal
> > > data
> > > >> > structures) or we port some portion of the existing tests over to
> > > arrow2
> > > >> > and then make that the official crate once all test pass.
> > > >>
> > > >> How feasible would it be to make a legacy module in arrow2 that
> would
> > > >> enable (some large subset of) existing arrow users to try arrow2
> after
> > > >> adjusting their use statements? (That is, implement the
> public-facing
> > > >> legacy interfaces in terms of arrow2's new, safe interface.) This
> would
> > > >> make it easier to test with DataFusion/Ballista and external users
> of
> > > the
> > > >> current arrow crate, then cut over and let those packages update
> > > >> incrementally from legacy to modern arrow2.
> > > >>
> > > >> I think it would be okay to tolerate some performance degradation
> when
> > > >> working through these legacy interfaces,so long as there was
> confidence
> > > >> that modernizing the callers would recover the performance (as tests
> > > have
> > > >> been showing).
> > > >>
> > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Hi,
>
> Thanks a lot for your feedback. I agree with all the arguments put forward,
> including Andrew's point about the large change.
>
> I tried a gradual 4 months ago, but it was really difficult and I gave up.
> I estimate that the work involved is half the work of writing parquet2 and
> arrow2 in the first place. The internal dependency on ArrayData (the main
> culprit of the unsafe) on arrow-rs is so prevalent that all core components
> need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> compute, SIMD). I personally do not have the motivation to do it, though.
>
> Jed, the public API changes are small for end users. A typical migration is
> [1]. I agree that we can further reduce the change-set by keeping legacy
> interfaces available.
>
> Andy, on my machine, the current benchmarks on query 1 yield:
>
> type, master (ms), PR [2] for arrow2+parquet2 (ms)
> memory (-m): 332.9, 239.6
> load (the initial time in -m with --format parquet): 5286.0, 3043.0
> parquet format: 1316.1, 930.7
> tbl format: 5297.3, 5383.1
>
> i.e. I am observing some improvements. Queries with joins are still slower.
> The pruning of parquet groups and pages based on stats are not yet there; I
> am working on them.
>
> I agree that this should go through IP clearance. I will start this
> process. My thinking would be to create two empty repos on apache/*, and
> create 2 PRs from the main branches of each of my repos to those repos, and
> only merge them once IP is cleared. Would that be a reasonable process, Wes?

This sounds plenty fine to me — I'm happy to assist with the IP
clearance process having done it several times in the past. I don't
have an opinion about the names, but having experimental- in the name
sounds in line with the previous discussion we had about this.

> Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
>
> Best,
> Jorge
>
> [1]
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> [2] https://github.com/apache/arrow-datafusion/pull/68
>
>
> On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com> wrote:
>
> > I played around with it, for my use case I really like the new way of
> > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > function as well.
> >
> > I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
> > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > parsing of lots of files, but it's pretty quick per file.
> >
> > old:
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > 17127928 bytes took 159ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > 17127144 bytes took 160ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > 17130352 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > 17128544 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > 17128664 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > 17128328 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > 17129288 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > 17131056 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > 17130344 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > 17128432 bytes took 160ms
> >
> > new:
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > 17127928 bytes took 157ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > 17127144 bytes took 152ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > 17130352 bytes took 154ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > 17128544 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > 17128664 bytes took 154ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > 17128328 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > 17129288 bytes took 152ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > 17131056 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > 17130344 bytes took 155ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > 17128432 bytes took 153ms
> >
> > I'm going to chunk the dirs to speed up the reads and throw it into a par
> > iter.
> >
> > On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com> wrote:
> >
> > > Hi!
> > >
> > > I've been using arrow/arrow-rs for a while now, my use case is to parse
> > > Arrow streaming files and convert them into CSV.
> > >
> > > Rust has been an absolute fantastic tool for this, the performance is
> > > outstanding and I have had no issues using it for my use case.
> > >
> > > I would be happy to test out the branch and let you know what the
> > > performance is like, as I was going to improve the current implementation
> > > that i have for the CSV writer, as it takes a while for bigger datasets
> > > (multi-GB).
> > >
> > > Josh
> > >
> > >
> > > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> > >
> > >> Andy Grove <an...@gmail.com> writes:
> > >> >
> > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > what
> > >> I
> > >> > would be interested in would be having a branch of DF that uses arrow2
> > >> and
> > >> > once that branch has all tests passing and can run queries with
> > >> performance
> > >> > that is at least as good as the original arrow crate, then cut over.
> > >> >
> > >> > However, for developers using the arrow APIs directly, I don't see an
> > >> easy
> > >> > path. We either try and gradually PR the changes in (which seems
> > really
> > >> > hard given that there are significant changes to APIs and internal
> > data
> > >> > structures) or we port some portion of the existing tests over to
> > arrow2
> > >> > and then make that the official crate once all test pass.
> > >>
> > >> How feasible would it be to make a legacy module in arrow2 that would
> > >> enable (some large subset of) existing arrow users to try arrow2 after
> > >> adjusting their use statements? (That is, implement the public-facing
> > >> legacy interfaces in terms of arrow2's new, safe interface.) This would
> > >> make it easier to test with DataFusion/Ballista and external users of
> > the
> > >> current arrow crate, then cut over and let those packages update
> > >> incrementally from legacy to modern arrow2.
> > >>
> > >> I think it would be okay to tolerate some performance degradation when
> > >> working through these legacy interfaces,so long as there was confidence
> > >> that modernizing the callers would recover the performance (as tests
> > have
> > >> been showing).
> > >>
> > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi,

Thanks a lot for your feedback. I agree with all the arguments put forward,
including Andrew's point about the large change.

I tried a gradual 4 months ago, but it was really difficult and I gave up.
I estimate that the work involved is half the work of writing parquet2 and
arrow2 in the first place. The internal dependency on ArrayData (the main
culprit of the unsafe) on arrow-rs is so prevalent that all core components
need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
compute, SIMD). I personally do not have the motivation to do it, though.

Jed, the public API changes are small for end users. A typical migration is
[1]. I agree that we can further reduce the change-set by keeping legacy
interfaces available.

Andy, on my machine, the current benchmarks on query 1 yield:

type, master (ms), PR [2] for arrow2+parquet2 (ms)
memory (-m): 332.9, 239.6
load (the initial time in -m with --format parquet): 5286.0, 3043.0
parquet format: 1316.1, 930.7
tbl format: 5297.3, 5383.1

i.e. I am observing some improvements. Queries with joins are still slower.
The pruning of parquet groups and pages based on stats are not yet there; I
am working on them.

I agree that this should go through IP clearance. I will start this
process. My thinking would be to create two empty repos on apache/*, and
create 2 PRs from the main branches of each of my repos to those repos, and
only merge them once IP is cleared. Would that be a reasonable process, Wes?

Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?

Best,
Jorge

[1]
https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
[2] https://github.com/apache/arrow-datafusion/pull/68


On Fri, May 28, 2021 at 5:22 AM Josh Taylor <jo...@gmail.com> wrote:

> I played around with it, for my use case I really like the new way of
> writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> function as well.
>
> I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
> bunch of files in a directory and spit out a CSV, the bottleneck is the
> parsing of lots of files, but it's pretty quick per file.
>
> old:
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> 17127928 bytes took 159ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> 17127144 bytes took 160ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> 17130352 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> 17128544 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> 17128664 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> 17128328 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> 17129288 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> 17131056 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> 17130344 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> 17128432 bytes took 160ms
>
> new:
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> 17127928 bytes took 157ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> 17127144 bytes took 152ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> 17130352 bytes took 154ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> 17128544 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> 17128664 bytes took 154ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> 17128328 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> 17129288 bytes took 152ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> 17131056 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> 17130344 bytes took 155ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> 17128432 bytes took 153ms
>
> I'm going to chunk the dirs to speed up the reads and throw it into a par
> iter.
>
> On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com> wrote:
>
> > Hi!
> >
> > I've been using arrow/arrow-rs for a while now, my use case is to parse
> > Arrow streaming files and convert them into CSV.
> >
> > Rust has been an absolute fantastic tool for this, the performance is
> > outstanding and I have had no issues using it for my use case.
> >
> > I would be happy to test out the branch and let you know what the
> > performance is like, as I was going to improve the current implementation
> > that i have for the CSV writer, as it takes a while for bigger datasets
> > (multi-GB).
> >
> > Josh
> >
> >
> > On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
> >
> >> Andy Grove <an...@gmail.com> writes:
> >> >
> >> > Looking at this purely from the DataFusion/Ballista point of view,
> what
> >> I
> >> > would be interested in would be having a branch of DF that uses arrow2
> >> and
> >> > once that branch has all tests passing and can run queries with
> >> performance
> >> > that is at least as good as the original arrow crate, then cut over.
> >> >
> >> > However, for developers using the arrow APIs directly, I don't see an
> >> easy
> >> > path. We either try and gradually PR the changes in (which seems
> really
> >> > hard given that there are significant changes to APIs and internal
> data
> >> > structures) or we port some portion of the existing tests over to
> arrow2
> >> > and then make that the official crate once all test pass.
> >>
> >> How feasible would it be to make a legacy module in arrow2 that would
> >> enable (some large subset of) existing arrow users to try arrow2 after
> >> adjusting their use statements? (That is, implement the public-facing
> >> legacy interfaces in terms of arrow2's new, safe interface.) This would
> >> make it easier to test with DataFusion/Ballista and external users of
> the
> >> current arrow crate, then cut over and let those packages update
> >> incrementally from legacy to modern arrow2.
> >>
> >> I think it would be okay to tolerate some performance degradation when
> >> working through these legacy interfaces,so long as there was confidence
> >> that modernizing the callers would recover the performance (as tests
> have
> >> been showing).
> >>
> >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Josh Taylor <jo...@gmail.com>.

I played around with it, for my use case I really like the new way of
writing CSVs, it's much more obvious. I love the `read_stream_metadata`
function as well.

I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
bunch of files in a directory and spit out a CSV, the bottleneck is the
parsing of lots of files, but it's pretty quick per file.

old:
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
17127928 bytes took 159ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
17127144 bytes took 160ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
17130352 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
17128544 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
17128664 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
17128328 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
17129288 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
17131056 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
17130344 bytes took 158ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
17128432 bytes took 160ms

new:
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
bytes took 1ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
17127928 bytes took 157ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
17127144 bytes took 152ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
17130352 bytes took 154ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
17128544 bytes took 153ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
17128664 bytes took 154ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
17128328 bytes took 153ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
17129288 bytes took 152ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
17131056 bytes took 153ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
17130344 bytes took 155ms
/home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
17128432 bytes took 153ms

I'm going to chunk the dirs to speed up the reads and throw it into a par
iter.

On Fri, 28 May 2021 at 09:09, Josh Taylor <jo...@gmail.com> wrote:

> Hi!
>
> I've been using arrow/arrow-rs for a while now, my use case is to parse
> Arrow streaming files and convert them into CSV.
>
> Rust has been an absolute fantastic tool for this, the performance is
> outstanding and I have had no issues using it for my use case.
>
> I would be happy to test out the branch and let you know what the
> performance is like, as I was going to improve the current implementation
> that i have for the CSV writer, as it takes a while for bigger datasets
> (multi-GB).
>
> Josh
>
>
> On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:
>
>> Andy Grove <an...@gmail.com> writes:
>> >
>> > Looking at this purely from the DataFusion/Ballista point of view, what
>> I
>> > would be interested in would be having a branch of DF that uses arrow2
>> and
>> > once that branch has all tests passing and can run queries with
>> performance
>> > that is at least as good as the original arrow crate, then cut over.
>> >
>> > However, for developers using the arrow APIs directly, I don't see an
>> easy
>> > path. We either try and gradually PR the changes in (which seems really
>> > hard given that there are significant changes to APIs and internal data
>> > structures) or we port some portion of the existing tests over to arrow2
>> > and then make that the official crate once all test pass.
>>
>> How feasible would it be to make a legacy module in arrow2 that would
>> enable (some large subset of) existing arrow users to try arrow2 after
>> adjusting their use statements? (That is, implement the public-facing
>> legacy interfaces in terms of arrow2's new, safe interface.) This would
>> make it easier to test with DataFusion/Ballista and external users of the
>> current arrow crate, then cut over and let those packages update
>> incrementally from legacy to modern arrow2.
>>
>> I think it would be okay to tolerate some performance degradation when
>> working through these legacy interfaces,so long as there was confidence
>> that modernizing the callers would recover the performance (as tests have
>> been showing).
>>
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Josh Taylor <jo...@gmail.com>.

Hi!

I've been using arrow/arrow-rs for a while now, my use case is to parse
Arrow streaming files and convert them into CSV.

Rust has been an absolute fantastic tool for this, the performance is
outstanding and I have had no issues using it for my use case.

I would be happy to test out the branch and let you know what the
performance is like, as I was going to improve the current implementation
that i have for the CSV writer, as it takes a while for bigger datasets
(multi-GB).

Josh


On Thu, 27 May 2021 at 22:49, Jed Brown <je...@jedbrown.org> wrote:

> Andy Grove <an...@gmail.com> writes:
> >
> > Looking at this purely from the DataFusion/Ballista point of view, what I
> > would be interested in would be having a branch of DF that uses arrow2
> and
> > once that branch has all tests passing and can run queries with
> performance
> > that is at least as good as the original arrow crate, then cut over.
> >
> > However, for developers using the arrow APIs directly, I don't see an
> easy
> > path. We either try and gradually PR the changes in (which seems really
> > hard given that there are significant changes to APIs and internal data
> > structures) or we port some portion of the existing tests over to arrow2
> > and then make that the official crate once all test pass.
>
> How feasible would it be to make a legacy module in arrow2 that would
> enable (some large subset of) existing arrow users to try arrow2 after
> adjusting their use statements? (That is, implement the public-facing
> legacy interfaces in terms of arrow2's new, safe interface.) This would
> make it easier to test with DataFusion/Ballista and external users of the
> current arrow crate, then cut over and let those packages update
> incrementally from legacy to modern arrow2.
>
> I think it would be okay to tolerate some performance degradation when
> working through these legacy interfaces,so long as there was confidence
> that modernizing the callers would recover the performance (as tests have
> been showing).
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jed Brown <je...@jedbrown.org>.

Andy Grove <an...@gmail.com> writes:
>
> Looking at this purely from the DataFusion/Ballista point of view, what I
> would be interested in would be having a branch of DF that uses arrow2 and
> once that branch has all tests passing and can run queries with performance
> that is at least as good as the original arrow crate, then cut over.
>
> However, for developers using the arrow APIs directly, I don't see an easy
> path. We either try and gradually PR the changes in (which seems really
> hard given that there are significant changes to APIs and internal data
> structures) or we port some portion of the existing tests over to arrow2
> and then make that the official crate once all test pass.

How feasible would it be to make a legacy module in arrow2 that would enable (some large subset of) existing arrow users to try arrow2 after adjusting their use statements? (That is, implement the public-facing legacy interfaces in terms of arrow2's new, safe interface.) This would make it easier to test with DataFusion/Ballista and external users of the current arrow crate, then cut over and let those packages update incrementally from legacy to modern arrow2.

I think it would be okay to tolerate some performance degradation when working through these legacy interfaces,so long as there was confidence that modernizing the callers would recover the performance (as tests have been showing).

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Andy Grove <an...@gmail.com>.

I don't have a very strong opinion on new repo vs branch but having a new
repo seems simpler and less overhead to me. I think it makes this effort
far more visible to the community and is more likely to get more people
involved.

Looking at this purely from the DataFusion/Ballista point of view, what I
would be interested in would be having a branch of DF that uses arrow2 and
once that branch has all tests passing and can run queries with performance
that is at least as good as the original arrow crate, then cut over.

However, for developers using the arrow APIs directly, I don't see an easy
path. We either try and gradually PR the changes in (which seems really
hard given that there are significant changes to APIs and internal data
structures) or we port some portion of the existing tests over to arrow2
and then make that the official crate once all test pass.

If I were doing the work, I think I would try and verify the new crate and
cut over rather than try and gradually PR the changes in, assuming that the
community is behind this approach and see the value of the new crate.

Andy.













On Thu, May 27, 2021 at 6:58 AM Wes McKinney <we...@gmail.com> wrote:

> I think given the size and scope of the work, there's a stronger
> argument for having an IP clearance for this code (as compared with
> python-datafusion).
>
> On Thu, May 27, 2021 at 5:45 AM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > I am not opposed to a new repo.
> >
> > However I believe that the largest barrier to the community really
> getting
> > their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't
> > something I am likely to get my head in any level of detail until I
> > actively work with it for a while.
> >
> > The best way to get community input, I think, is to start the process of
> > getting arrow2 into arrow-rs via PRs. While splitting it up into multiple
> > PRs somehow is likely not easy and would require lots more work, starting
> > to get this work into arrow-rs in smaller chunks would be the ideal
> outcome
> > in my opinion.
> >
> > Therefore, I don't see any benefit to a new repo -- I think a branch in
> > arrow-rs (or a fork) would work just as well. But again, I am not opposed
> > to a new repo either.
> >
> > Andrew
> >
> > On Wed, May 26, 2021 at 3:47 AM Fernando Herrera <
> > fernando.j.herrera@gmail.com> wrote:
> >
> > > Thanks Jorge for the update and the continuous development on a
> > > safer version of arrow.
> > >
> > > I would like to give my support for option 3 as well. IMHO it will give
> > > arrow2 the exposition it needs to be considered by a wider set of
> > > users. This exposition will open the possibility to receive more
> > > participation regarding missing features required to integrate
> > > arrow2 to datafusion and ballista.
> > >
> > > It will also give peace of mind to arrow users that arrow2 will
> > > follow the apache way, meaning that its development will be stable,
> > > supported and community driven. I have dealt with this issue myself,
> > > where I have suggested the use of arrow2 for new projects, only
> > > to be discarded because of the impression that it isn't supported
> > > by the apache community; even after seeing the advantages the
> > > project presents.
> > >
> > >
> > > Fernando
> > >
> > > On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
> > > jorgecarleitao@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to offer an update on this. I continued to work heavily
> on
> > > > this hypothesis on separate repos [1, 2], as this required a
> ground-up
> > > > refactor.
> > > >
> > > > Following is the current status, and at the end some options that we
> can
> > > > think about.
> > > >
> > > > TL;DR: I would like to gauge the communities' interest in making
> arrow2
> > > and
> > > > parquet2 experimental repos in Apache Arrow. IMO they are safer,
> faster,
> > > > more maintainable and equally compatible with both the arrow spec and
> > > > parquet spec.
> > > >
> > > > # Specification and interoperability
> > > >
> > > > IPC:
> > > >
> > > > All integration tests that in 4.1.0 runs pass against
> > > apache/arrow@master.
> > > > Furthermore, it passes the following tests that 4.1.0 does not:
> > > > * big endian IPC producer and consumer
> > > > * decimal128
> > > > * interval
> > > > * nested_large_offsets
> > > >
> > > > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
> > > >
> > > > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0
> are
> > > > passing or are skipped.
> > > >
> > > > parquet: arrow2 tests against parquet files generated by pyarrow
> under
> > > > different configurations:
> > > > * physical and logical types
> > > > * page versions
> > > > * repetition levels
> > > > * dictionary encoding
> > > >
> > > > # Safety
> > > >
> > > > * arrow2 addresses all security vulnerabilities (all our +20 issues
> > > labeled
> > > > with "security" [3] and more currently not encapsulated in any
> issue) and
> > > > unsafety issues. In particular,
> > > > * all tests pass under MIRI checks
> > > > * all unsafe APIs are marked as unsafe
> > > > * parquet2 does not use unsafe
> > > >
> > > > # Maintenance
> > > >
> > > > * arrow + parquet has a total of 56k+36k LOC, excluding headers
> > > > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> > > > * arrow2 coverage is 76%, arrow is 88%
> > > > * parquet2 is "unsafe"-free
> > > >
> > > > # Features
> > > >
> > > > Non-spec wise (e.g. compute, utils, parquet), the crate has about
> 90% of
> > > > all the features in 4.0. What is missing:
> > > > * nested read and write parquet (lists and structs)
> > > > * missing some new features since 4.0.0
> > > >
> > > > OTOH, it has the following additional features:
> > > >
> > > > * API to read CSV in parallel
> > > > * API to read parquet in parallel
> > > > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> > > > versions of add, subtract, etc.)
> > > > * arithmetics ops over dates, timestamps and durations
> > > > * display for every logical type
> > > > * more cast operations
> > > > * merge-sort kernel
> > > > * vectorized hashing
> > > >
> > > > # Performance
> > > >
> > > > * 3-15x reading and writing parquet, and APIs to read them in
> parallel
> > > (see
> > > > [4])
> > > > * faster IPC read
> > > > * arithmetics and the like are about the same performance as arrow
> 4.0
> > > > compiled with SIMD (available in nightly), ~1.5x faster without SIMD
> and
> > > > nightly.
> > > > * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> > > > "take" in arrow 4.1 allows out of bound reads and is thus faster).
> > > > * Sort and filter ~2x faster. See [5,6]
> > > >
> > > > # Interoperability with DataFusion
> > > >
> > > > I have an experimental PR [7] in DataFusion. The THPC1 yields the
> same
> > > > result and has about the same performance (datafusion can perform
> > > > out-of-bound reads from arrow...), without group bys, it is ~2x
> faster.
> > > >
> > > > # Process / Community
> > > >
> > > > This is a "big chunk of code" type of situation developed over an
> > > external
> > > > repo. I tried to keep folks informed of the status and what was being
> > > done;
> > > > the mono-repo at the time was really difficult to cope with. With
> this
> > > > said:
> > > >
> > > > * I proposed an experimental repo mechanism so that we can conduct
> this
> > > > type of activities (now merged [8])
> > > > * I am not merging PRs that introduce new major API, so that the
> > > community
> > > > can weight in
> > > >
> > > > # Licensing
> > > >
> > > > All code is licensed under MIT and Apache; contributors are required
> to
> > > > accept any of these as part of their contributions. I can work to
> handle
> > > > this part with the incubator / ASF ahead of any potential code
> movement.
> > > >
> > > > -----------------
> > > >
> > > > Main question: what do we do?
> > > >
> > > > Some ideas:
> > > >
> > > > 1. PR all this to the arrow-rs repo
> > > > 2. push this to a branch in arrow-rs
> > > > 3. move this to an experimental repo within ASF and work on it until
> we
> > > > have feature parity (e.g. read and write nested types to/from
> parquet),
> > > and
> > > > then apply 1 or 2
> > > > 4. do nothing
> > > >
> > > > Concerns with option 1:
> > > > * development will continue to happen outside ASF
> > > > * no easy way to collaborate: issue tracking for this code outside
> ASF
> > > > * no easy way for the community to weight in over changes to the API
> > > prior
> > > > to merge
> > > > Concerns with option 2:
> > > > * issue tracking about branches is confusing, specially in creating
> > > change
> > > > logs
> > > > * PRs to branches is confusing
> > > > * no easy way for the community to weight in over changes to the API
> > > prior
> > > > to merge
> > > > Concerns with option 3:
> > > > * it would be the first experimental repo, thus some risk
> > > > Concerns with option 4:
> > > > * for the time being this would continue to be a project independent
> of
> > > the
> > > > Apache Arrow
> > > > * I would release 0.1 to crate.io as "arrow-safe" or something, as
> there
> > > > is
> > > > demand for it.
> > > >
> > > > I would be in favor of option 3 for the following reason: I do not
> think
> > > it
> > > > is useful to PR a -92k +57k change without giving the community
> extensive
> > > > time to evaluate, contribute, and time for a proper discussion. In
> this
> > > > context, my idea here was to give some time for the ideas to mature
> > > within
> > > > ASF, and only then even consider a switch.
> > > >
> > > > Thanks,
> > > > Jorge
> > > >
> > > > [1] https://github.com/jorgecarleitao/arrow2
> > > > [2] https://github.com/jorgecarleitao/parquet2
> > > > [3]
> > > >
> > > >
> > >
> https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> > > > [4]
> > > >
> > > >
> > >
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> > > > [5]
> > > >
> > > >
> > >
> https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> > > > [6]
> > > >
> > > >
> > >
> https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> > > > [7] https://github.com/apache/arrow-datafusion/pull/68
> > > > [8] https://issues.apache.org/jira/browse/ARROW-12643
> > > >
> > > >
> > > > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> > > > jorgecarleitao@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Over the past 4 months, I have been growing more and more
> frustrated by
> > > > > the amount of undefined behaviour that I am finding and fixing on
> the
> > > > Rust
> > > > > implementation. I would like to open the discussion of a broader
> > > overview
> > > > > about the problem in light of our current knowledge and what Rust
> > > enables
> > > > > as well as offer a solution to the bigger problem.
> > > > >
> > > > > Just to give you a gist of the seriousness of the issue, the
> following
> > > > > currently compiles, runs, and is undefined behavior in Rust:
> > > > >
> > > > > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> > > > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
> vec![]);let
> > > > array = Float64Array::from(Arc::new(data));
> > > > > println!("{:?}", array.value(1));
> > > > >
> > > > > I would like to propose a major refactor of the crate around
> physical
> > > > > traits, Buffer, MutableBuffer and ArrayData to make our code
> type-safe
> > > at
> > > > > compile time, thereby avoiding things like the example above from
> > > > happening
> > > > > again.
> > > > >
> > > > > So far, I was able to reproduce all core features of the arrow
> crate
> > > > > (nested types, dynamic typing, FFI, memory alignment, performance)
> by
> > > > using
> > > > > `Buffer<T: NativeType>` instead of `Buffer` and removing
> `ArrayData`
> > > and
> > > > > RawPointer altogether.
> > > > >
> > > > > Safety-wise, it significantly limits the usage of `unsafe` on
> higher
> > > end
> > > > > APIs, it has a single transmute (the bit chunk iterator one), and a
> > > > > guaranteed safe public API (which is not the case in our master, as
> > > shown
> > > > > above).
> > > > >
> > > > > Performance wise, it yields a 1.3x improvement over the current
> master
> > > > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB
> on
> > > the
> > > > > take kernel, 1.7x prior to it) for the `take` kernel for
> primitives. I
> > > > > should have other major performance improvements.
> > > > >
> > > > > API wise, it simplifies the traits that we have for memory layout
> as
> > > well
> > > > > as the handling of bitmaps, offsets, etc.
> > > > >
> > > > > The proposal is drafted as a README
> > > > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md>
> on
> > > a
> > > > > repo that I created specifically for this from the ground up, and
> the
> > > > full
> > > > > set of changes are in a PR
> > > > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone
> can
> > > > view
> > > > > and comment on it. I haven't made any PR to master because this is
> too
> > > > > large to track as a diff against master, and is beyond the point,
> > > > anyways.
> > > > >
> > > > > I haven't ported most of the crate as I only tried the non-trivial
> > > > > features (memory layout, bitmaps, FFI, dynamic typing, nested
> types).
> > > > >
> > > > > I would highly appreciate your thoughts about it.
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > >
> > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Wes McKinney <we...@gmail.com>.

I think given the size and scope of the work, there's a stronger
argument for having an IP clearance for this code (as compared with
python-datafusion).

On Thu, May 27, 2021 at 5:45 AM Andrew Lamb <al...@influxdata.com> wrote:
>
> I am not opposed to a new repo.
>
> However I believe that the largest barrier to the community really getting
> their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't
> something I am likely to get my head in any level of detail until I
> actively work with it for a while.
>
> The best way to get community input, I think, is to start the process of
> getting arrow2 into arrow-rs via PRs. While splitting it up into multiple
> PRs somehow is likely not easy and would require lots more work, starting
> to get this work into arrow-rs in smaller chunks would be the ideal outcome
> in my opinion.
>
> Therefore, I don't see any benefit to a new repo -- I think a branch in
> arrow-rs (or a fork) would work just as well. But again, I am not opposed
> to a new repo either.
>
> Andrew
>
> On Wed, May 26, 2021 at 3:47 AM Fernando Herrera <
> fernando.j.herrera@gmail.com> wrote:
>
> > Thanks Jorge for the update and the continuous development on a
> > safer version of arrow.
> >
> > I would like to give my support for option 3 as well. IMHO it will give
> > arrow2 the exposition it needs to be considered by a wider set of
> > users. This exposition will open the possibility to receive more
> > participation regarding missing features required to integrate
> > arrow2 to datafusion and ballista.
> >
> > It will also give peace of mind to arrow users that arrow2 will
> > follow the apache way, meaning that its development will be stable,
> > supported and community driven. I have dealt with this issue myself,
> > where I have suggested the use of arrow2 for new projects, only
> > to be discarded because of the impression that it isn't supported
> > by the apache community; even after seeing the advantages the
> > project presents.
> >
> >
> > Fernando
> >
> > On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
> > jorgecarleitao@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to offer an update on this. I continued to work heavily on
> > > this hypothesis on separate repos [1, 2], as this required a ground-up
> > > refactor.
> > >
> > > Following is the current status, and at the end some options that we can
> > > think about.
> > >
> > > TL;DR: I would like to gauge the communities' interest in making arrow2
> > and
> > > parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
> > > more maintainable and equally compatible with both the arrow spec and
> > > parquet spec.
> > >
> > > # Specification and interoperability
> > >
> > > IPC:
> > >
> > > All integration tests that in 4.1.0 runs pass against
> > apache/arrow@master.
> > > Furthermore, it passes the following tests that 4.1.0 does not:
> > > * big endian IPC producer and consumer
> > > * decimal128
> > > * interval
> > > * nested_large_offsets
> > >
> > > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
> > >
> > > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
> > > passing or are skipped.
> > >
> > > parquet: arrow2 tests against parquet files generated by pyarrow under
> > > different configurations:
> > > * physical and logical types
> > > * page versions
> > > * repetition levels
> > > * dictionary encoding
> > >
> > > # Safety
> > >
> > > * arrow2 addresses all security vulnerabilities (all our +20 issues
> > labeled
> > > with "security" [3] and more currently not encapsulated in any issue) and
> > > unsafety issues. In particular,
> > > * all tests pass under MIRI checks
> > > * all unsafe APIs are marked as unsafe
> > > * parquet2 does not use unsafe
> > >
> > > # Maintenance
> > >
> > > * arrow + parquet has a total of 56k+36k LOC, excluding headers
> > > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> > > * arrow2 coverage is 76%, arrow is 88%
> > > * parquet2 is "unsafe"-free
> > >
> > > # Features
> > >
> > > Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
> > > all the features in 4.0. What is missing:
> > > * nested read and write parquet (lists and structs)
> > > * missing some new features since 4.0.0
> > >
> > > OTOH, it has the following additional features:
> > >
> > > * API to read CSV in parallel
> > > * API to read parquet in parallel
> > > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> > > versions of add, subtract, etc.)
> > > * arithmetics ops over dates, timestamps and durations
> > > * display for every logical type
> > > * more cast operations
> > > * merge-sort kernel
> > > * vectorized hashing
> > >
> > > # Performance
> > >
> > > * 3-15x reading and writing parquet, and APIs to read them in parallel
> > (see
> > > [4])
> > > * faster IPC read
> > > * arithmetics and the like are about the same performance as arrow 4.0
> > > compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
> > > nightly.
> > > * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> > > "take" in arrow 4.1 allows out of bound reads and is thus faster).
> > > * Sort and filter ~2x faster. See [5,6]
> > >
> > > # Interoperability with DataFusion
> > >
> > > I have an experimental PR [7] in DataFusion. The THPC1 yields the same
> > > result and has about the same performance (datafusion can perform
> > > out-of-bound reads from arrow...), without group bys, it is ~2x faster.
> > >
> > > # Process / Community
> > >
> > > This is a "big chunk of code" type of situation developed over an
> > external
> > > repo. I tried to keep folks informed of the status and what was being
> > done;
> > > the mono-repo at the time was really difficult to cope with. With this
> > > said:
> > >
> > > * I proposed an experimental repo mechanism so that we can conduct this
> > > type of activities (now merged [8])
> > > * I am not merging PRs that introduce new major API, so that the
> > community
> > > can weight in
> > >
> > > # Licensing
> > >
> > > All code is licensed under MIT and Apache; contributors are required to
> > > accept any of these as part of their contributions. I can work to handle
> > > this part with the incubator / ASF ahead of any potential code movement.
> > >
> > > -----------------
> > >
> > > Main question: what do we do?
> > >
> > > Some ideas:
> > >
> > > 1. PR all this to the arrow-rs repo
> > > 2. push this to a branch in arrow-rs
> > > 3. move this to an experimental repo within ASF and work on it until we
> > > have feature parity (e.g. read and write nested types to/from parquet),
> > and
> > > then apply 1 or 2
> > > 4. do nothing
> > >
> > > Concerns with option 1:
> > > * development will continue to happen outside ASF
> > > * no easy way to collaborate: issue tracking for this code outside ASF
> > > * no easy way for the community to weight in over changes to the API
> > prior
> > > to merge
> > > Concerns with option 2:
> > > * issue tracking about branches is confusing, specially in creating
> > change
> > > logs
> > > * PRs to branches is confusing
> > > * no easy way for the community to weight in over changes to the API
> > prior
> > > to merge
> > > Concerns with option 3:
> > > * it would be the first experimental repo, thus some risk
> > > Concerns with option 4:
> > > * for the time being this would continue to be a project independent of
> > the
> > > Apache Arrow
> > > * I would release 0.1 to crate.io as "arrow-safe" or something, as there
> > > is
> > > demand for it.
> > >
> > > I would be in favor of option 3 for the following reason: I do not think
> > it
> > > is useful to PR a -92k +57k change without giving the community extensive
> > > time to evaluate, contribute, and time for a proper discussion. In this
> > > context, my idea here was to give some time for the ideas to mature
> > within
> > > ASF, and only then even consider a switch.
> > >
> > > Thanks,
> > > Jorge
> > >
> > > [1] https://github.com/jorgecarleitao/arrow2
> > > [2] https://github.com/jorgecarleitao/parquet2
> > > [3]
> > >
> > >
> > https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> > > [4]
> > >
> > >
> > https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> > > [5]
> > >
> > >
> > https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> > > [6]
> > >
> > >
> > https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> > > [7] https://github.com/apache/arrow-datafusion/pull/68
> > > [8] https://issues.apache.org/jira/browse/ARROW-12643
> > >
> > >
> > > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> > > jorgecarleitao@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Over the past 4 months, I have been growing more and more frustrated by
> > > > the amount of undefined behaviour that I am finding and fixing on the
> > > Rust
> > > > implementation. I would like to open the discussion of a broader
> > overview
> > > > about the problem in light of our current knowledge and what Rust
> > enables
> > > > as well as offer a solution to the bigger problem.
> > > >
> > > > Just to give you a gist of the seriousness of the issue, the following
> > > > currently compiles, runs, and is undefined behavior in Rust:
> > > >
> > > > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> > > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let
> > > array = Float64Array::from(Arc::new(data));
> > > > println!("{:?}", array.value(1));
> > > >
> > > > I would like to propose a major refactor of the crate around physical
> > > > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe
> > at
> > > > compile time, thereby avoiding things like the example above from
> > > happening
> > > > again.
> > > >
> > > > So far, I was able to reproduce all core features of the arrow crate
> > > > (nested types, dynamic typing, FFI, memory alignment, performance) by
> > > using
> > > > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData`
> > and
> > > > RawPointer altogether.
> > > >
> > > > Safety-wise, it significantly limits the usage of `unsafe` on higher
> > end
> > > > APIs, it has a single transmute (the bit chunk iterator one), and a
> > > > guaranteed safe public API (which is not the case in our master, as
> > shown
> > > > above).
> > > >
> > > > Performance wise, it yields a 1.3x improvement over the current master
> > > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on
> > the
> > > > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> > > > should have other major performance improvements.
> > > >
> > > > API wise, it simplifies the traits that we have for memory layout as
> > well
> > > > as the handling of bitmaps, offsets, etc.
> > > >
> > > > The proposal is drafted as a README
> > > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on
> > a
> > > > repo that I created specifically for this from the ground up, and the
> > > full
> > > > set of changes are in a PR
> > > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can
> > > view
> > > > and comment on it. I haven't made any PR to master because this is too
> > > > large to track as a diff against master, and is beyond the point,
> > > anyways.
> > > >
> > > > I haven't ported most of the crate as I only tried the non-trivial
> > > > features (memory layout, bitmaps, FFI, dynamic typing, nested types).
> > > >
> > > > I would highly appreciate your thoughts about it.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Andrew Lamb <al...@influxdata.com>.

I am not opposed to a new repo.

However I believe that the largest barrier to the community really getting
their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't
something I am likely to get my head in any level of detail until I
actively work with it for a while.

The best way to get community input, I think, is to start the process of
getting arrow2 into arrow-rs via PRs. While splitting it up into multiple
PRs somehow is likely not easy and would require lots more work, starting
to get this work into arrow-rs in smaller chunks would be the ideal outcome
in my opinion.

Therefore, I don't see any benefit to a new repo -- I think a branch in
arrow-rs (or a fork) would work just as well. But again, I am not opposed
to a new repo either.

Andrew

On Wed, May 26, 2021 at 3:47 AM Fernando Herrera <
fernando.j.herrera@gmail.com> wrote:

> Thanks Jorge for the update and the continuous development on a
> safer version of arrow.
>
> I would like to give my support for option 3 as well. IMHO it will give
> arrow2 the exposition it needs to be considered by a wider set of
> users. This exposition will open the possibility to receive more
> participation regarding missing features required to integrate
> arrow2 to datafusion and ballista.
>
> It will also give peace of mind to arrow users that arrow2 will
> follow the apache way, meaning that its development will be stable,
> supported and community driven. I have dealt with this issue myself,
> where I have suggested the use of arrow2 for new projects, only
> to be discarded because of the impression that it isn't supported
> by the apache community; even after seeing the advantages the
> project presents.
>
>
> Fernando
>
> On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to offer an update on this. I continued to work heavily on
> > this hypothesis on separate repos [1, 2], as this required a ground-up
> > refactor.
> >
> > Following is the current status, and at the end some options that we can
> > think about.
> >
> > TL;DR: I would like to gauge the communities' interest in making arrow2
> and
> > parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
> > more maintainable and equally compatible with both the arrow spec and
> > parquet spec.
> >
> > # Specification and interoperability
> >
> > IPC:
> >
> > All integration tests that in 4.1.0 runs pass against
> apache/arrow@master.
> > Furthermore, it passes the following tests that 4.1.0 does not:
> > * big endian IPC producer and consumer
> > * decimal128
> > * interval
> > * nested_large_offsets
> >
> > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
> >
> > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
> > passing or are skipped.
> >
> > parquet: arrow2 tests against parquet files generated by pyarrow under
> > different configurations:
> > * physical and logical types
> > * page versions
> > * repetition levels
> > * dictionary encoding
> >
> > # Safety
> >
> > * arrow2 addresses all security vulnerabilities (all our +20 issues
> labeled
> > with "security" [3] and more currently not encapsulated in any issue) and
> > unsafety issues. In particular,
> > * all tests pass under MIRI checks
> > * all unsafe APIs are marked as unsafe
> > * parquet2 does not use unsafe
> >
> > # Maintenance
> >
> > * arrow + parquet has a total of 56k+36k LOC, excluding headers
> > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> > * arrow2 coverage is 76%, arrow is 88%
> > * parquet2 is "unsafe"-free
> >
> > # Features
> >
> > Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
> > all the features in 4.0. What is missing:
> > * nested read and write parquet (lists and structs)
> > * missing some new features since 4.0.0
> >
> > OTOH, it has the following additional features:
> >
> > * API to read CSV in parallel
> > * API to read parquet in parallel
> > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> > versions of add, subtract, etc.)
> > * arithmetics ops over dates, timestamps and durations
> > * display for every logical type
> > * more cast operations
> > * merge-sort kernel
> > * vectorized hashing
> >
> > # Performance
> >
> > * 3-15x reading and writing parquet, and APIs to read them in parallel
> (see
> > [4])
> > * faster IPC read
> > * arithmetics and the like are about the same performance as arrow 4.0
> > compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
> > nightly.
> > * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> > "take" in arrow 4.1 allows out of bound reads and is thus faster).
> > * Sort and filter ~2x faster. See [5,6]
> >
> > # Interoperability with DataFusion
> >
> > I have an experimental PR [7] in DataFusion. The THPC1 yields the same
> > result and has about the same performance (datafusion can perform
> > out-of-bound reads from arrow...), without group bys, it is ~2x faster.
> >
> > # Process / Community
> >
> > This is a "big chunk of code" type of situation developed over an
> external
> > repo. I tried to keep folks informed of the status and what was being
> done;
> > the mono-repo at the time was really difficult to cope with. With this
> > said:
> >
> > * I proposed an experimental repo mechanism so that we can conduct this
> > type of activities (now merged [8])
> > * I am not merging PRs that introduce new major API, so that the
> community
> > can weight in
> >
> > # Licensing
> >
> > All code is licensed under MIT and Apache; contributors are required to
> > accept any of these as part of their contributions. I can work to handle
> > this part with the incubator / ASF ahead of any potential code movement.
> >
> > -----------------
> >
> > Main question: what do we do?
> >
> > Some ideas:
> >
> > 1. PR all this to the arrow-rs repo
> > 2. push this to a branch in arrow-rs
> > 3. move this to an experimental repo within ASF and work on it until we
> > have feature parity (e.g. read and write nested types to/from parquet),
> and
> > then apply 1 or 2
> > 4. do nothing
> >
> > Concerns with option 1:
> > * development will continue to happen outside ASF
> > * no easy way to collaborate: issue tracking for this code outside ASF
> > * no easy way for the community to weight in over changes to the API
> prior
> > to merge
> > Concerns with option 2:
> > * issue tracking about branches is confusing, specially in creating
> change
> > logs
> > * PRs to branches is confusing
> > * no easy way for the community to weight in over changes to the API
> prior
> > to merge
> > Concerns with option 3:
> > * it would be the first experimental repo, thus some risk
> > Concerns with option 4:
> > * for the time being this would continue to be a project independent of
> the
> > Apache Arrow
> > * I would release 0.1 to crate.io as "arrow-safe" or something, as there
> > is
> > demand for it.
> >
> > I would be in favor of option 3 for the following reason: I do not think
> it
> > is useful to PR a -92k +57k change without giving the community extensive
> > time to evaluate, contribute, and time for a proper discussion. In this
> > context, my idea here was to give some time for the ideas to mature
> within
> > ASF, and only then even consider a switch.
> >
> > Thanks,
> > Jorge
> >
> > [1] https://github.com/jorgecarleitao/arrow2
> > [2] https://github.com/jorgecarleitao/parquet2
> > [3]
> >
> >
> https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> > [4]
> >
> >
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> > [5]
> >
> >
> https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> > [6]
> >
> >
> https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> > [7] https://github.com/apache/arrow-datafusion/pull/68
> > [8] https://issues.apache.org/jira/browse/ARROW-12643
> >
> >
> > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> > jorgecarleitao@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Over the past 4 months, I have been growing more and more frustrated by
> > > the amount of undefined behaviour that I am finding and fixing on the
> > Rust
> > > implementation. I would like to open the discussion of a broader
> overview
> > > about the problem in light of our current knowledge and what Rust
> enables
> > > as well as offer a solution to the bigger problem.
> > >
> > > Just to give you a gist of the seriousness of the issue, the following
> > > currently compiles, runs, and is undefined behavior in Rust:
> > >
> > > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let
> > array = Float64Array::from(Arc::new(data));
> > > println!("{:?}", array.value(1));
> > >
> > > I would like to propose a major refactor of the crate around physical
> > > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe
> at
> > > compile time, thereby avoiding things like the example above from
> > happening
> > > again.
> > >
> > > So far, I was able to reproduce all core features of the arrow crate
> > > (nested types, dynamic typing, FFI, memory alignment, performance) by
> > using
> > > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData`
> and
> > > RawPointer altogether.
> > >
> > > Safety-wise, it significantly limits the usage of `unsafe` on higher
> end
> > > APIs, it has a single transmute (the bit chunk iterator one), and a
> > > guaranteed safe public API (which is not the case in our master, as
> shown
> > > above).
> > >
> > > Performance wise, it yields a 1.3x improvement over the current master
> > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on
> the
> > > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> > > should have other major performance improvements.
> > >
> > > API wise, it simplifies the traits that we have for memory layout as
> well
> > > as the handling of bitmaps, offsets, etc.
> > >
> > > The proposal is drafted as a README
> > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on
> a
> > > repo that I created specifically for this from the ground up, and the
> > full
> > > set of changes are in a PR
> > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can
> > view
> > > and comment on it. I haven't made any PR to master because this is too
> > > large to track as a diff against master, and is beyond the point,
> > anyways.
> > >
> > > I haven't ported most of the crate as I only tried the non-trivial
> > > features (memory layout, bitmaps, FFI, dynamic typing, nested types).
> > >
> > > I would highly appreciate your thoughts about it.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Fernando Herrera <fe...@gmail.com>.

Thanks Jorge for the update and the continuous development on a
safer version of arrow.

I would like to give my support for option 3 as well. IMHO it will give
arrow2 the exposition it needs to be considered by a wider set of
users. This exposition will open the possibility to receive more
participation regarding missing features required to integrate
arrow2 to datafusion and ballista.

It will also give peace of mind to arrow users that arrow2 will
follow the apache way, meaning that its development will be stable,
supported and community driven. I have dealt with this issue myself,
where I have suggested the use of arrow2 for new projects, only
to be discarded because of the impression that it isn't supported
by the apache community; even after seeing the advantages the
project presents.


Fernando

On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi,
>
> I would like to offer an update on this. I continued to work heavily on
> this hypothesis on separate repos [1, 2], as this required a ground-up
> refactor.
>
> Following is the current status, and at the end some options that we can
> think about.
>
> TL;DR: I would like to gauge the communities' interest in making arrow2 and
> parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
> more maintainable and equally compatible with both the arrow spec and
> parquet spec.
>
> # Specification and interoperability
>
> IPC:
>
> All integration tests that in 4.1.0 runs pass against apache/arrow@master.
> Furthermore, it passes the following tests that 4.1.0 does not:
> * big endian IPC producer and consumer
> * decimal128
> * interval
> * nested_large_offsets
>
> FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
>
> Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
> passing or are skipped.
>
> parquet: arrow2 tests against parquet files generated by pyarrow under
> different configurations:
> * physical and logical types
> * page versions
> * repetition levels
> * dictionary encoding
>
> # Safety
>
> * arrow2 addresses all security vulnerabilities (all our +20 issues labeled
> with "security" [3] and more currently not encapsulated in any issue) and
> unsafety issues. In particular,
> * all tests pass under MIRI checks
> * all unsafe APIs are marked as unsafe
> * parquet2 does not use unsafe
>
> # Maintenance
>
> * arrow + parquet has a total of 56k+36k LOC, excluding headers
> * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> * arrow2 coverage is 76%, arrow is 88%
> * parquet2 is "unsafe"-free
>
> # Features
>
> Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
> all the features in 4.0. What is missing:
> * nested read and write parquet (lists and structs)
> * missing some new features since 4.0.0
>
> OTOH, it has the following additional features:
>
> * API to read CSV in parallel
> * API to read parquet in parallel
> * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> versions of add, subtract, etc.)
> * arithmetics ops over dates, timestamps and durations
> * display for every logical type
> * more cast operations
> * merge-sort kernel
> * vectorized hashing
>
> # Performance
>
> * 3-15x reading and writing parquet, and APIs to read them in parallel (see
> [4])
> * faster IPC read
> * arithmetics and the like are about the same performance as arrow 4.0
> compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
> nightly.
> * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> "take" in arrow 4.1 allows out of bound reads and is thus faster).
> * Sort and filter ~2x faster. See [5,6]
>
> # Interoperability with DataFusion
>
> I have an experimental PR [7] in DataFusion. The THPC1 yields the same
> result and has about the same performance (datafusion can perform
> out-of-bound reads from arrow...), without group bys, it is ~2x faster.
>
> # Process / Community
>
> This is a "big chunk of code" type of situation developed over an external
> repo. I tried to keep folks informed of the status and what was being done;
> the mono-repo at the time was really difficult to cope with. With this
> said:
>
> * I proposed an experimental repo mechanism so that we can conduct this
> type of activities (now merged [8])
> * I am not merging PRs that introduce new major API, so that the community
> can weight in
>
> # Licensing
>
> All code is licensed under MIT and Apache; contributors are required to
> accept any of these as part of their contributions. I can work to handle
> this part with the incubator / ASF ahead of any potential code movement.
>
> -----------------
>
> Main question: what do we do?
>
> Some ideas:
>
> 1. PR all this to the arrow-rs repo
> 2. push this to a branch in arrow-rs
> 3. move this to an experimental repo within ASF and work on it until we
> have feature parity (e.g. read and write nested types to/from parquet), and
> then apply 1 or 2
> 4. do nothing
>
> Concerns with option 1:
> * development will continue to happen outside ASF
> * no easy way to collaborate: issue tracking for this code outside ASF
> * no easy way for the community to weight in over changes to the API prior
> to merge
> Concerns with option 2:
> * issue tracking about branches is confusing, specially in creating change
> logs
> * PRs to branches is confusing
> * no easy way for the community to weight in over changes to the API prior
> to merge
> Concerns with option 3:
> * it would be the first experimental repo, thus some risk
> Concerns with option 4:
> * for the time being this would continue to be a project independent of the
> Apache Arrow
> * I would release 0.1 to crate.io as "arrow-safe" or something, as there
> is
> demand for it.
>
> I would be in favor of option 3 for the following reason: I do not think it
> is useful to PR a -92k +57k change without giving the community extensive
> time to evaluate, contribute, and time for a proper discussion. In this
> context, my idea here was to give some time for the ideas to mature within
> ASF, and only then even consider a switch.
>
> Thanks,
> Jorge
>
> [1] https://github.com/jorgecarleitao/arrow2
> [2] https://github.com/jorgecarleitao/parquet2
> [3]
>
> https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> [4]
>
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> [5]
>
> https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> [6]
>
> https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> [7] https://github.com/apache/arrow-datafusion/pull/68
> [8] https://issues.apache.org/jira/browse/ARROW-12643
>
>
> On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Hi,
> >
> > Over the past 4 months, I have been growing more and more frustrated by
> > the amount of undefined behaviour that I am finding and fixing on the
> Rust
> > implementation. I would like to open the discussion of a broader overview
> > about the problem in light of our current knowledge and what Rust enables
> > as well as offer a solution to the bigger problem.
> >
> > Just to give you a gist of the seriousness of the issue, the following
> > currently compiles, runs, and is undefined behavior in Rust:
> >
> > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let
> array = Float64Array::from(Arc::new(data));
> > println!("{:?}", array.value(1));
> >
> > I would like to propose a major refactor of the crate around physical
> > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
> > compile time, thereby avoiding things like the example above from
> happening
> > again.
> >
> > So far, I was able to reproduce all core features of the arrow crate
> > (nested types, dynamic typing, FFI, memory alignment, performance) by
> using
> > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
> > RawPointer altogether.
> >
> > Safety-wise, it significantly limits the usage of `unsafe` on higher end
> > APIs, it has a single transmute (the bit chunk iterator one), and a
> > guaranteed safe public API (which is not the case in our master, as shown
> > above).
> >
> > Performance wise, it yields a 1.3x improvement over the current master
> > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
> > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> > should have other major performance improvements.
> >
> > API wise, it simplifies the traits that we have for memory layout as well
> > as the handling of bitmaps, offsets, etc.
> >
> > The proposal is drafted as a README
> > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
> > repo that I created specifically for this from the ground up, and the
> full
> > set of changes are in a PR
> > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can
> view
> > and comment on it. I haven't made any PR to master because this is too
> > large to track as a diff against master, and is beyond the point,
> anyways.
> >
> > I haven't ported most of the crate as I only tried the non-trivial
> > features (memory layout, bitmaps, FFI, dynamic typing, nested types).
> >
> > I would highly appreciate your thoughts about it.
> >
> > Best,
> > Jorge
> >
> >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi,

I would like to offer an update on this. I continued to work heavily on
this hypothesis on separate repos [1, 2], as this required a ground-up
refactor.

Following is the current status, and at the end some options that we can
think about.

TL;DR: I would like to gauge the communities' interest in making arrow2 and
parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
more maintainable and equally compatible with both the arrow spec and
parquet spec.

# Specification and interoperability

IPC:

All integration tests that in 4.1.0 runs pass against apache/arrow@master.
Furthermore, it passes the following tests that 4.1.0 does not:
* big endian IPC producer and consumer
* decimal128
* interval
* nested_large_offsets

FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.

Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
passing or are skipped.

parquet: arrow2 tests against parquet files generated by pyarrow under
different configurations:
* physical and logical types
* page versions
* repetition levels
* dictionary encoding

# Safety

* arrow2 addresses all security vulnerabilities (all our +20 issues labeled
with "security" [3] and more currently not encapsulated in any issue) and
unsafety issues. In particular,
* all tests pass under MIRI checks
* all unsafe APIs are marked as unsafe
* parquet2 does not use unsafe

# Maintenance

* arrow + parquet has a total of 56k+36k LOC, excluding headers
* arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
* arrow2 coverage is 76%, arrow is 88%
* parquet2 is "unsafe"-free

# Features

Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
all the features in 4.0. What is missing:
* nested read and write parquet (lists and structs)
* missing some new features since 4.0.0

OTOH, it has the following additional features:

* API to read CSV in parallel
* API to read parquet in parallel
* checked_X, saturating_X, overflowing_X operations (i.e. non-panic
versions of add, subtract, etc.)
* arithmetics ops over dates, timestamps and durations
* display for every logical type
* more cast operations
* merge-sort kernel
* vectorized hashing

# Performance

* 3-15x reading and writing parquet, and APIs to read them in parallel (see
[4])
* faster IPC read
* arithmetics and the like are about the same performance as arrow 4.0
compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
nightly.
* Some kernels degrade by about 20% due to bound checks (e.g. boolean
"take" in arrow 4.1 allows out of bound reads and is thus faster).
* Sort and filter ~2x faster. See [5,6]

# Interoperability with DataFusion

I have an experimental PR [7] in DataFusion. The THPC1 yields the same
result and has about the same performance (datafusion can perform
out-of-bound reads from arrow...), without group bys, it is ~2x faster.

# Process / Community

This is a "big chunk of code" type of situation developed over an external
repo. I tried to keep folks informed of the status and what was being done;
the mono-repo at the time was really difficult to cope with. With this said:

* I proposed an experimental repo mechanism so that we can conduct this
type of activities (now merged [8])
* I am not merging PRs that introduce new major API, so that the community
can weight in

# Licensing

All code is licensed under MIT and Apache; contributors are required to
accept any of these as part of their contributions. I can work to handle
this part with the incubator / ASF ahead of any potential code movement.

-----------------

Main question: what do we do?

Some ideas:

1. PR all this to the arrow-rs repo
2. push this to a branch in arrow-rs
3. move this to an experimental repo within ASF and work on it until we
have feature parity (e.g. read and write nested types to/from parquet), and
then apply 1 or 2
4. do nothing

Concerns with option 1:
* development will continue to happen outside ASF
* no easy way to collaborate: issue tracking for this code outside ASF
* no easy way for the community to weight in over changes to the API prior
to merge
Concerns with option 2:
* issue tracking about branches is confusing, specially in creating change
logs
* PRs to branches is confusing
* no easy way for the community to weight in over changes to the API prior
to merge
Concerns with option 3:
* it would be the first experimental repo, thus some risk
Concerns with option 4:
* for the time being this would continue to be a project independent of the
Apache Arrow
* I would release 0.1 to crate.io as "arrow-safe" or something, as there is
demand for it.

I would be in favor of option 3 for the following reason: I do not think it
is useful to PR a -92k +57k change without giving the community extensive
time to evaluate, contribute, and time for a proper discussion. In this
context, my idea here was to give some time for the ideas to mature within
ASF, and only then even consider a switch.

Thanks,
Jorge

[1] https://github.com/jorgecarleitao/arrow2
[2] https://github.com/jorgecarleitao/parquet2
[3]
https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
[4]
https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
[5]
https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
[6]
https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
[7] https://github.com/apache/arrow-datafusion/pull/68
[8] https://issues.apache.org/jira/browse/ARROW-12643

On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi,
>
> Over the past 4 months, I have been growing more and more frustrated by
> the amount of undefined behaviour that I am finding and fixing on the Rust
> implementation. I would like to open the discussion of a broader overview
> about the problem in light of our current knowledge and what Rust enables
> as well as offer a solution to the bigger problem.
>
> Just to give you a gist of the seriousness of the issue, the following
> currently compiles, runs, and is undefined behavior in Rust:
>
> let buffer = Buffer::from(&[0i32, 2i32]);let data = ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let array = Float64Array::from(Arc::new(data));
> println!("{:?}", array.value(1));
>
> I would like to propose a major refactor of the crate around physical
> traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
> compile time, thereby avoiding things like the example above from happening
> again.
>
> So far, I was able to reproduce all core features of the arrow crate
> (nested types, dynamic typing, FFI, memory alignment, performance) by using
> `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
> RawPointer altogether.
>
> Safety-wise, it significantly limits the usage of `unsafe` on higher end
> APIs, it has a single transmute (the bit chunk iterator one), and a
> guaranteed safe public API (which is not the case in our master, as shown
> above).
>
> Performance wise, it yields a 1.3x improvement over the current master
> (after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
> take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> should have other major performance improvements.
>
> API wise, it simplifies the traits that we have for memory layout as well
> as the handling of bitmaps, offsets, etc.
>
> The proposal is drafted as a README
> <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
> repo that I created specifically for this from the ground up, and the full
> set of changes are in a PR
> <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can view
> and comment on it. I haven't made any PR to master because this is too
> large to track as a diff against master, and is beyond the point, anyways.
>
> I haven't ported most of the crate as I only tried the non-trivial
> features (memory layout, bitmaps, FFI, dynamic typing, nested types).
>
> I would highly appreciate your thoughts about it.
>
> Best,
> Jorge
>
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Posted by Fernando Herrera <fe...@gmail.com>.

Hi Jorge,

I tried running the code you pasted but it didnt compile. I get the next
error:

the trait `AsRef<[u8]>` is not implemented for `[i32; 2i32]`


I had to change it to this to compile:

let buffer = Buffer::from(&[0u8, 2]);
> let data = ArrayData::new(DataType::Int64, 10, None, None, 0,
> vec![buffer], vec![]);
> let array = Float64Array::from(Arc::new(data));
>
println!("{:?}", array.value(1))


I didn't get an error out of it.

I do agree with you that there are several instances of unsafe in the code
that are not properly justified and that may lead to more problems in
the future.

Another thing that I have noticed is the pattern used in the api where a
struct has an implementation called new and another called
new_with_options. The second function happens to exist only if you want to
create it using an options object. This could be simplified by using an
enum with all the possible options and usen an Option<&[NEW ENUM]> for the
optional parameters.

I think it does make sense to think about the crate and plan it better to
avoid these issues and improve it further.

Regards,
Fernando

On Sun, Feb 7, 2021 at 1:42 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi,
>
> Over the past 4 months, I have been growing more and more frustrated by the
> amount of undefined behaviour that I am finding and fixing on the Rust
> implementation. I would like to open the discussion of a broader overview
> about the problem in light of our current knowledge and what Rust enables
> as well as offer a solution to the bigger problem.
>
> Just to give you a gist of the seriousness of the issue, the following
> currently compiles, runs, and is undefined behavior in Rust:
>
> let buffer = Buffer::from(&[0i32, 2i32]);let data =
> ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
> vec![]);let array = Float64Array::from(Arc::new(data));
> println!("{:?}", array.value(1));
>
> I would like to propose a major refactor of the crate around physical
> traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
> compile time, thereby avoiding things like the example above from happening
> again.
>
> So far, I was able to reproduce all core features of the arrow crate
> (nested types, dynamic typing, FFI, memory alignment, performance) by using
> `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
> RawPointer altogether.
>
> Safety-wise, it significantly limits the usage of `unsafe` on higher end
> APIs, it has a single transmute (the bit chunk iterator one), and a
> guaranteed safe public API (which is not the case in our master, as shown
> above).
>
> Performance wise, it yields a 1.3x improvement over the current master
> (after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
> take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> should have other major performance improvements.
>
> API wise, it simplifies the traits that we have for memory layout as well
> as the handling of bitmaps, offsets, etc.
>
> The proposal is drafted as a README
> <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
> repo that I created specifically for this from the ground up, and the full
> set of changes are in a PR <
> https://github.com/jorgecarleitao/arrow2/pull/1>
> so that anyone can view and comment on it. I haven't made any PR to master
> because this is too large to track as a diff against master, and is beyond
> the point, anyways.
>
> I haven't ported most of the crate as I only tried the non-trivial features
> (memory layout, bitmaps, FFI, dynamic typing, nested types).
>
> I would highly appreciate your thoughts about it.
>
> Best,
> Jorge
>