You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jeszy <je...@gmail.com> on 2022/05/18 13:54:47 UTC

ARROW-11465

Hello,

I wanted to circle back to this topic and make sure there's a decision
by the community. Although there was sporadic discussion over jira[1],
the PR[2], and this list[3] in the past, the messaging across these
channels changed over time. E.g. while the PR comment is negative, the
much more recent Jira comment suggests a way forward - is this feature
likely to gain consensus if following those comments? I'd like to
avoid wasted effort.

An upside not explicitly mentioned before is that while appending to a
multi-file Parquet dataset is already possible, as Weston mentioned,
being able to append row groups and rewrite the footer would enable
the file's (dataset's) overall metadata to be maintained as new data
is added.

Thanks,
Balazs

---
[1] https://issues.apache.org/jira/browse/ARROW-11465
[2] https://github.com/apache/arrow/pull/8130
[3] https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw

Re: ARROW-11465

Posted by Jacques Nadeau <ja...@apache.org>.
I second Weston's comments. The idea of separate files is part of the de
jure spec but not the de facto one. It's up to the parquet community
whether the de facto spec should be "altered" . Afaik, zero oss readers
support use of this field.



On Wed, May 18, 2022, 8:53 AM Weston Pace <we...@gmail.com> wrote:

> I can try and clarify my earlier feedback:
>
> This is an Arrow datasets question if your goal is to create multiple
> independent parquet files, each one a complete file, and read them as
> a combined dataset.
> This is not an Arrow question (but instead a parquet question) if your
> goal is to create a single "parquet object" that is read as a single
> unit by a parquet reader, but spans multiple physical files.
>
> Your PR appears to be addressing the second case.  I think Micah's
> feedback is correct in this case.  You should bring this up on the
> parquet list and make sure there is some interest in adding this
> feature in other implementations first.  Otherwise there is a risk you
> will create these "parquet objects" that can only be read by a single
> implementation of the parquet reader.
>
> On Wed, May 18, 2022 at 3:55 AM Jeszy <je...@gmail.com> wrote:
> >
> > Hello,
> >
> > I wanted to circle back to this topic and make sure there's a decision
> > by the community. Although there was sporadic discussion over jira[1],
> > the PR[2], and this list[3] in the past, the messaging across these
> > channels changed over time. E.g. while the PR comment is negative, the
> > much more recent Jira comment suggests a way forward - is this feature
> > likely to gain consensus if following those comments? I'd like to
> > avoid wasted effort.
> >
> > An upside not explicitly mentioned before is that while appending to a
> > multi-file Parquet dataset is already possible, as Weston mentioned,
> > being able to append row groups and rewrite the footer would enable
> > the file's (dataset's) overall metadata to be maintained as new data
> > is added.
> >
> > Thanks,
> > Balazs
> >
> > ---
> > [1] https://issues.apache.org/jira/browse/ARROW-11465
> > [2] https://github.com/apache/arrow/pull/8130
> > [3] https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>

Re: ARROW-11465

Posted by Weston Pace <we...@gmail.com>.
I can try and clarify my earlier feedback:

This is an Arrow datasets question if your goal is to create multiple
independent parquet files, each one a complete file, and read them as
a combined dataset.
This is not an Arrow question (but instead a parquet question) if your
goal is to create a single "parquet object" that is read as a single
unit by a parquet reader, but spans multiple physical files.

Your PR appears to be addressing the second case.  I think Micah's
feedback is correct in this case.  You should bring this up on the
parquet list and make sure there is some interest in adding this
feature in other implementations first.  Otherwise there is a risk you
will create these "parquet objects" that can only be read by a single
implementation of the parquet reader.

On Wed, May 18, 2022 at 3:55 AM Jeszy <je...@gmail.com> wrote:
>
> Hello,
>
> I wanted to circle back to this topic and make sure there's a decision
> by the community. Although there was sporadic discussion over jira[1],
> the PR[2], and this list[3] in the past, the messaging across these
> channels changed over time. E.g. while the PR comment is negative, the
> much more recent Jira comment suggests a way forward - is this feature
> likely to gain consensus if following those comments? I'd like to
> avoid wasted effort.
>
> An upside not explicitly mentioned before is that while appending to a
> multi-file Parquet dataset is already possible, as Weston mentioned,
> being able to append row groups and rewrite the footer would enable
> the file's (dataset's) overall metadata to be maintained as new data
> is added.
>
> Thanks,
> Balazs
>
> ---
> [1] https://issues.apache.org/jira/browse/ARROW-11465
> [2] https://github.com/apache/arrow/pull/8130
> [3] https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw