You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/04/01 03:35:03 UTC
Re: The future of Parquet development for Arrow Rust?

At least for testing, would using the new C data interface for FFI from
Rust to C++ (where Rust code provides Arrow Data and a file path to write
to?) be an easy to use short term solution?

On Tue, Mar 31, 2020 at 7:42 AM Andy Grove <an...@gmail.com> wrote:

> To get the ball rolling, here is a quick and dirty PR adding a test that
> writes an Arrow batch to a Parquet file.
>
> https://github.com/apache/arrow/pull/6785
>
> I'll keep iterating on this but will gladly accept help or hand this off to
> someone better qualified.
>
>
>
> On Tue, Mar 31, 2020 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Here was the last discussion about this 6 months ago
> >
> > https://github.com/apache/parquet-testing/pull/9
> >
> > I saw another PR come through like this so that's why I'm bringing it up
> > again
> >
> > https://github.com/apache/parquet-testing/pull/11
> >
> > On Tue, Mar 31, 2020 at 9:08 AM Andy Grove <an...@gmail.com>
> wrote:
> > >
> > > Hi Wes,
> > >
> > > I agree that this is important. I have been looking at the Parquet
> > > implementation this morning and I do see code for writing files., along
> > > with roundtrip tests As you said, It isn't writing from Arrow types yet
> > but
> > > I would hope that this would be relatively simple to add. I don't know
> > how
> > > complete the Parquet writer code is. It would be useful to get some
> > > guidance from the main authors of this crate.
> > >
> > > I'd be happy to create some JIRAs and try and help organize an effort
> > here
> > > for the next release.
> > >
> > > Andy.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Mar 30, 2020 at 6:07 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > More than a year has passed since the Parquet Rust project joined
> > > > forces with Apache Arrow.
> > > >
> > > > I raised this issue in the past, but the project still cannot write
> > > > files originating from Arrow records. In my opinion, this creates
> > > > sustainability / development scalability problems for the ongoing
> > > > development of the project. In particular, testing has to rely on
> > > > binary files either pre-generated or generated by another library.
> > > > This makes everything harder (testing, feature development,
> > > > benchmarking, and so forth) and increases the chance of failing to
> > > > cover edge cases.
> > > >
> > > > Looking back on over 4 years of C++ Parquet development, I doubt we
> > > > could have gotten the project to where it is now without a writer
> > > > implementation moving together with the reader. For example, we've
> had
> > > > to deal with issues arising in very large files (e.g. BinaryArray
> > > > overflows), and in many cases it would not be practical to store a
> > > > pre-generated file exhibiting some of these problems.
> > > >
> > > > Of course, as a volunteer driven effort no one can be forced to
> > > > implement a writer, but since a good amount of time has passed I feel
> > > > I need to raise awareness of the issue again to see if an effort
> might
> > > > be mobilized, since this also impacts people who might come to rely
> on
> > > > this code in production. Given the importance of Parquet in current
> > > > times, having a rock solid Parquet library will likely become
> > > > essential to sustained adoption of the Arrow Rust project (it has
> > > > certainly been very important for C++/Python/R adoption).
> > > >
> > > > best,
> > > > Wes
> > > >
> >
>