You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/03/31 00:06:22 UTC

The future of Parquet development for Arrow Rust?

hi folks,

More than a year has passed since the Parquet Rust project joined
forces with Apache Arrow.

I raised this issue in the past, but the project still cannot write
files originating from Arrow records. In my opinion, this creates
sustainability / development scalability problems for the ongoing
development of the project. In particular, testing has to rely on
binary files either pre-generated or generated by another library.
This makes everything harder (testing, feature development,
benchmarking, and so forth) and increases the chance of failing to
cover edge cases.

Looking back on over 4 years of C++ Parquet development, I doubt we
could have gotten the project to where it is now without a writer
implementation moving together with the reader. For example, we've had
to deal with issues arising in very large files (e.g. BinaryArray
overflows), and in many cases it would not be practical to store a
pre-generated file exhibiting some of these problems.

Of course, as a volunteer driven effort no one can be forced to
implement a writer, but since a good amount of time has passed I feel
I need to raise awareness of the issue again to see if an effort might
be mobilized, since this also impacts people who might come to rely on
this code in production. Given the importance of Parquet in current
times, having a rock solid Parquet library will likely become
essential to sustained adoption of the Arrow Rust project (it has
certainly been very important for C++/Python/R adoption).

best,
Wes

Re: The future of Parquet development for Arrow Rust?

Posted by Micah Kornfield <em...@gmail.com>.

At least for testing, would using the new C data interface for FFI from
Rust to C++ (where Rust code provides Arrow Data and a file path to write
to?) be an easy to use short term solution?

On Tue, Mar 31, 2020 at 7:42 AM Andy Grove <an...@gmail.com> wrote:

> To get the ball rolling, here is a quick and dirty PR adding a test that
> writes an Arrow batch to a Parquet file.
>
> https://github.com/apache/arrow/pull/6785
>
> I'll keep iterating on this but will gladly accept help or hand this off to
> someone better qualified.
>
>
>
> On Tue, Mar 31, 2020 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Here was the last discussion about this 6 months ago
> >
> > https://github.com/apache/parquet-testing/pull/9
> >
> > I saw another PR come through like this so that's why I'm bringing it up
> > again
> >
> > https://github.com/apache/parquet-testing/pull/11
> >
> > On Tue, Mar 31, 2020 at 9:08 AM Andy Grove <an...@gmail.com>
> wrote:
> > >
> > > Hi Wes,
> > >
> > > I agree that this is important. I have been looking at the Parquet
> > > implementation this morning and I do see code for writing files., along
> > > with roundtrip tests As you said, It isn't writing from Arrow types yet
> > but
> > > I would hope that this would be relatively simple to add. I don't know
> > how
> > > complete the Parquet writer code is. It would be useful to get some
> > > guidance from the main authors of this crate.
> > >
> > > I'd be happy to create some JIRAs and try and help organize an effort
> > here
> > > for the next release.
> > >
> > > Andy.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Mar 30, 2020 at 6:07 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > More than a year has passed since the Parquet Rust project joined
> > > > forces with Apache Arrow.
> > > >
> > > > I raised this issue in the past, but the project still cannot write
> > > > files originating from Arrow records. In my opinion, this creates
> > > > sustainability / development scalability problems for the ongoing
> > > > development of the project. In particular, testing has to rely on
> > > > binary files either pre-generated or generated by another library.
> > > > This makes everything harder (testing, feature development,
> > > > benchmarking, and so forth) and increases the chance of failing to
> > > > cover edge cases.
> > > >
> > > > Looking back on over 4 years of C++ Parquet development, I doubt we
> > > > could have gotten the project to where it is now without a writer
> > > > implementation moving together with the reader. For example, we've
> had
> > > > to deal with issues arising in very large files (e.g. BinaryArray
> > > > overflows), and in many cases it would not be practical to store a
> > > > pre-generated file exhibiting some of these problems.
> > > >
> > > > Of course, as a volunteer driven effort no one can be forced to
> > > > implement a writer, but since a good amount of time has passed I feel
> > > > I need to raise awareness of the issue again to see if an effort
> might
> > > > be mobilized, since this also impacts people who might come to rely
> on
> > > > this code in production. Given the importance of Parquet in current
> > > > times, having a rock solid Parquet library will likely become
> > > > essential to sustained adoption of the Arrow Rust project (it has
> > > > certainly been very important for C++/Python/R adoption).
> > > >
> > > > best,
> > > > Wes
> > > >
> >
>

Re: The future of Parquet development for Arrow Rust?

Posted by Andy Grove <an...@gmail.com>.

To get the ball rolling, here is a quick and dirty PR adding a test that
writes an Arrow batch to a Parquet file.

https://github.com/apache/arrow/pull/6785

I'll keep iterating on this but will gladly accept help or hand this off to
someone better qualified.



On Tue, Mar 31, 2020 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:

> Here was the last discussion about this 6 months ago
>
> https://github.com/apache/parquet-testing/pull/9
>
> I saw another PR come through like this so that's why I'm bringing it up
> again
>
> https://github.com/apache/parquet-testing/pull/11
>
> On Tue, Mar 31, 2020 at 9:08 AM Andy Grove <an...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > I agree that this is important. I have been looking at the Parquet
> > implementation this morning and I do see code for writing files., along
> > with roundtrip tests As you said, It isn't writing from Arrow types yet
> but
> > I would hope that this would be relatively simple to add. I don't know
> how
> > complete the Parquet writer code is. It would be useful to get some
> > guidance from the main authors of this crate.
> >
> > I'd be happy to create some JIRAs and try and help organize an effort
> here
> > for the next release.
> >
> > Andy.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 30, 2020 at 6:07 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > More than a year has passed since the Parquet Rust project joined
> > > forces with Apache Arrow.
> > >
> > > I raised this issue in the past, but the project still cannot write
> > > files originating from Arrow records. In my opinion, this creates
> > > sustainability / development scalability problems for the ongoing
> > > development of the project. In particular, testing has to rely on
> > > binary files either pre-generated or generated by another library.
> > > This makes everything harder (testing, feature development,
> > > benchmarking, and so forth) and increases the chance of failing to
> > > cover edge cases.
> > >
> > > Looking back on over 4 years of C++ Parquet development, I doubt we
> > > could have gotten the project to where it is now without a writer
> > > implementation moving together with the reader. For example, we've had
> > > to deal with issues arising in very large files (e.g. BinaryArray
> > > overflows), and in many cases it would not be practical to store a
> > > pre-generated file exhibiting some of these problems.
> > >
> > > Of course, as a volunteer driven effort no one can be forced to
> > > implement a writer, but since a good amount of time has passed I feel
> > > I need to raise awareness of the issue again to see if an effort might
> > > be mobilized, since this also impacts people who might come to rely on
> > > this code in production. Given the importance of Parquet in current
> > > times, having a rock solid Parquet library will likely become
> > > essential to sustained adoption of the Arrow Rust project (it has
> > > certainly been very important for C++/Python/R adoption).
> > >
> > > best,
> > > Wes
> > >
>

Re: The future of Parquet development for Arrow Rust?

Posted by Wes McKinney <we...@gmail.com>.

Here was the last discussion about this 6 months ago

https://github.com/apache/parquet-testing/pull/9

I saw another PR come through like this so that's why I'm bringing it up again

https://github.com/apache/parquet-testing/pull/11

On Tue, Mar 31, 2020 at 9:08 AM Andy Grove <an...@gmail.com> wrote:
>
> Hi Wes,
>
> I agree that this is important. I have been looking at the Parquet
> implementation this morning and I do see code for writing files., along
> with roundtrip tests As you said, It isn't writing from Arrow types yet but
> I would hope that this would be relatively simple to add. I don't know how
> complete the Parquet writer code is. It would be useful to get some
> guidance from the main authors of this crate.
>
> I'd be happy to create some JIRAs and try and help organize an effort here
> for the next release.
>
> Andy.
>
>
>
>
>
>
>
>
>
> On Mon, Mar 30, 2020 at 6:07 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > More than a year has passed since the Parquet Rust project joined
> > forces with Apache Arrow.
> >
> > I raised this issue in the past, but the project still cannot write
> > files originating from Arrow records. In my opinion, this creates
> > sustainability / development scalability problems for the ongoing
> > development of the project. In particular, testing has to rely on
> > binary files either pre-generated or generated by another library.
> > This makes everything harder (testing, feature development,
> > benchmarking, and so forth) and increases the chance of failing to
> > cover edge cases.
> >
> > Looking back on over 4 years of C++ Parquet development, I doubt we
> > could have gotten the project to where it is now without a writer
> > implementation moving together with the reader. For example, we've had
> > to deal with issues arising in very large files (e.g. BinaryArray
> > overflows), and in many cases it would not be practical to store a
> > pre-generated file exhibiting some of these problems.
> >
> > Of course, as a volunteer driven effort no one can be forced to
> > implement a writer, but since a good amount of time has passed I feel
> > I need to raise awareness of the issue again to see if an effort might
> > be mobilized, since this also impacts people who might come to rely on
> > this code in production. Given the importance of Parquet in current
> > times, having a rock solid Parquet library will likely become
> > essential to sustained adoption of the Arrow Rust project (it has
> > certainly been very important for C++/Python/R adoption).
> >
> > best,
> > Wes
> >

Re: The future of Parquet development for Arrow Rust?

Posted by Andy Grove <an...@gmail.com>.

Hi Wes,

I agree that this is important. I have been looking at the Parquet
implementation this morning and I do see code for writing files., along
with roundtrip tests As you said, It isn't writing from Arrow types yet but
I would hope that this would be relatively simple to add. I don't know how
complete the Parquet writer code is. It would be useful to get some
guidance from the main authors of this crate.

I'd be happy to create some JIRAs and try and help organize an effort here
for the next release.

Andy.









On Mon, Mar 30, 2020 at 6:07 PM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> More than a year has passed since the Parquet Rust project joined
> forces with Apache Arrow.
>
> I raised this issue in the past, but the project still cannot write
> files originating from Arrow records. In my opinion, this creates
> sustainability / development scalability problems for the ongoing
> development of the project. In particular, testing has to rely on
> binary files either pre-generated or generated by another library.
> This makes everything harder (testing, feature development,
> benchmarking, and so forth) and increases the chance of failing to
> cover edge cases.
>
> Looking back on over 4 years of C++ Parquet development, I doubt we
> could have gotten the project to where it is now without a writer
> implementation moving together with the reader. For example, we've had
> to deal with issues arising in very large files (e.g. BinaryArray
> overflows), and in many cases it would not be practical to store a
> pre-generated file exhibiting some of these problems.
>
> Of course, as a volunteer driven effort no one can be forced to
> implement a writer, but since a good amount of time has passed I feel
> I need to raise awareness of the issue again to see if an effort might
> be mobilized, since this also impacts people who might come to rely on
> this code in production. Given the importance of Parquet in current
> times, having a rock solid Parquet library will likely become
> essential to sustained adoption of the Arrow Rust project (it has
> certainly been very important for C++/Python/R adoption).
>
> best,
> Wes
>