You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Radu Teodorescu <ra...@yahoo.com.INVALID> on 2020/09/10 22:59:33 UTC

Implementing support for interpreting ColumnChunk.file_path in parquet file readers

Hello,
I am looking into enabling C++ arrow parquet reading API to read ColumnChunks that are in different files than the file containing the metadata in line with original parquet specification:

https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769 <https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769>

and I would like to get broader parquet community feedback with regards to the merits and potential pitfalls of supporting this feature.

On the merits front, support for this feature opens up a number of possibilities, many of them (I’m sure) having already served as motivation for adding this feature in the first place:

- Allows for an entire data set (separately produced parquet files) to be handled seamlessly as a unified parquet file 
- Allows for derived parquet files (that say add or remove a column or a row group to an existing one) to be generated without copying the common data
- Allows for a parquet file to written and read concurrently (by producing intermediary metadata files that point to the row groups already written in full)

This functionality is already supported by fastparquet implementation https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187 <https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187>  and I am happy to assist with a java implementation if there is interest.

I am looking forward to all the exciting insights.
Thank you
Radu

Re: Implementing support for interpreting ColumnChunk.file_path in parquet file readers

Posted by Micah Kornfield <em...@gmail.com>.

>
> Concurrent writing will also benefit wide Parquet schemas that have
> hundreds or thousands of columns.


The impression I got was this use-case was typically outside of what
parquet typically supports?

> Concurrent writing will also benefit wide Parquet schemas that have
> hundreds or thousands of columns.






On Fri, Sep 11, 2020 at 6:48 AM Deepak Majeti <ma...@gmail.com>
wrote:

> +1 for this support.
> Concurrent writing will also benefit wide Parquet schemas that have
> hundreds or thousands of columns.
>
>
> On Thu, Sep 10, 2020 at 6:59 PM Radu Teodorescu
> <ra...@yahoo.com.invalid> wrote:
>
> > Hello,
> > I am looking into enabling C++ arrow parquet reading API to read
> > ColumnChunks that are in different files than the file containing the
> > metadata in line with original parquet specification:
> >
> >
> >
> https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
> > <
> >
> https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
> > >
> >
> > and I would like to get broader parquet community feedback with regards
> to
> > the merits and potential pitfalls of supporting this feature.
> >
> > On the merits front, support for this feature opens up a number of
> > possibilities, many of them (I’m sure) having already served as
> motivation
> > for adding this feature in the first place:
> >
> > - Allows for an entire data set (separately produced parquet files) to be
> > handled seamlessly as a unified parquet file
> > - Allows for derived parquet files (that say add or remove a column or a
> > row group to an existing one) to be generated without copying the common
> > data
> > - Allows for a parquet file to written and read concurrently (by
> producing
> > intermediary metadata files that point to the row groups already written
> in
> > full)
> >
> > This functionality is already supported by fastparquet implementation
> >
> https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187
> > <
> >
> https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187
> >
> > and I am happy to assist with a java implementation if there is interest.
> >
> > I am looking forward to all the exciting insights.
> > Thank you
> > Radu
>
>
>
> --
> regards,
> Deepak Majeti
>

Re: Implementing support for interpreting ColumnChunk.file_path in parquet file readers

Posted by Deepak Majeti <ma...@gmail.com>.

+1 for this support.
Concurrent writing will also benefit wide Parquet schemas that have
hundreds or thousands of columns.


On Thu, Sep 10, 2020 at 6:59 PM Radu Teodorescu
<ra...@yahoo.com.invalid> wrote:

> Hello,
> I am looking into enabling C++ arrow parquet reading API to read
> ColumnChunks that are in different files than the file containing the
> metadata in line with original parquet specification:
>
>
> https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
> <
> https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
> >
>
> and I would like to get broader parquet community feedback with regards to
> the merits and potential pitfalls of supporting this feature.
>
> On the merits front, support for this feature opens up a number of
> possibilities, many of them (I’m sure) having already served as motivation
> for adding this feature in the first place:
>
> - Allows for an entire data set (separately produced parquet files) to be
> handled seamlessly as a unified parquet file
> - Allows for derived parquet files (that say add or remove a column or a
> row group to an existing one) to be generated without copying the common
> data
> - Allows for a parquet file to written and read concurrently (by producing
> intermediary metadata files that point to the row groups already written in
> full)
>
> This functionality is already supported by fastparquet implementation
> https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187
> <
> https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187>
> and I am happy to assist with a java implementation if there is interest.
>
> I am looking forward to all the exciting insights.
> Thank you
> Radu



-- 
regards,
Deepak Majeti