You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Zhuo Jia Dai <zh...@gmail.com> on 2020/05/22 03:45:01 UTC

definition written before repetition?

I raise this issue  https://github.com/JuliaIO/Parquet.jl/issues/60

where the official parquet documentation states that repetition levels are
written before definition levels. However, in the Julia Parquet package the
parquet implementation reads definition before the repetition levels and
the author insists on him being right but did not provide further evidence.

I wanted to double-check this with the parquet dev community? Is it true
that definitions levels need to be written before repetition levels? If
true then the parquet documentation is wrong then I am happy to PR a fix.

Regards
-- 
ZJ

zhuojia.dai@gmail.com

Re: definition written before repetition?

Posted by Wes McKinney <we...@gmail.com>.
Sorry, I'm wrong -- C++ is doing it correctly, I was looking at the
wrong code. False alarm!

https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L685

I was shocked that such a blatant correctness issue might have existed
but since people have been able to read nested data files with Spark
and other systems everything is fine in C++.

On Fri, May 22, 2020 at 12:53 PM Wes McKinney <we...@gmail.com> wrote:
>
> If that's the case (and according to the Format documentation it is)
> then we are doing it incorrectly in C++. How depressing
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1097
>
> This is unfortunately what happens when you don't have more rigorous
> integration tests.
>
>
> On Fri, May 22, 2020 at 3:14 AM Gabor Szadovszky <ga...@apache.org> wrote:
> >
> > Hi ZJ,
> >
> > parquet-mr clearly writes repetition levels and definition levels according
> > to the specification. See the following code references.
> > For V1 pages:
> > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
> > For V2 pages:
> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
> >
> > Regards,
> > Gabor
> >
> >
> > On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai <zh...@gmail.com> wrote:
> >
> > > I raise this issue  https://github.com/JuliaIO/Parquet.jl/issues/60
> > >
> > > where the official parquet documentation states that repetition levels are
> > > written before definition levels. However, in the Julia Parquet package the
> > > parquet implementation reads definition before the repetition levels and
> > > the author insists on him being right but did not provide further evidence.
> > >
> > > I wanted to double-check this with the parquet dev community? Is it true
> > > that definitions levels need to be written before repetition levels? If
> > > true then the parquet documentation is wrong then I am happy to PR a fix.
> > >
> > > Regards
> > > --
> > > ZJ
> > >
> > > zhuojia.dai@gmail.com
> > >

Re: definition written before repetition?

Posted by Wes McKinney <we...@gmail.com>.
If that's the case (and according to the Format documentation it is)
then we are doing it incorrectly in C++. How depressing

https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1097

This is unfortunately what happens when you don't have more rigorous
integration tests.


On Fri, May 22, 2020 at 3:14 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> Hi ZJ,
>
> parquet-mr clearly writes repetition levels and definition levels according
> to the specification. See the following code references.
> For V1 pages:
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
> For V2 pages:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
>
> Regards,
> Gabor
>
>
> On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai <zh...@gmail.com> wrote:
>
> > I raise this issue  https://github.com/JuliaIO/Parquet.jl/issues/60
> >
> > where the official parquet documentation states that repetition levels are
> > written before definition levels. However, in the Julia Parquet package the
> > parquet implementation reads definition before the repetition levels and
> > the author insists on him being right but did not provide further evidence.
> >
> > I wanted to double-check this with the parquet dev community? Is it true
> > that definitions levels need to be written before repetition levels? If
> > true then the parquet documentation is wrong then I am happy to PR a fix.
> >
> > Regards
> > --
> > ZJ
> >
> > zhuojia.dai@gmail.com
> >

Re: definition written before repetition?

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi ZJ,

parquet-mr clearly writes repetition levels and definition levels according
to the specification. See the following code references.
For V1 pages:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
For V2 pages:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225

Regards,
Gabor


On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai <zh...@gmail.com> wrote:

> I raise this issue  https://github.com/JuliaIO/Parquet.jl/issues/60
>
> where the official parquet documentation states that repetition levels are
> written before definition levels. However, in the Julia Parquet package the
> parquet implementation reads definition before the repetition levels and
> the author insists on him being right but did not provide further evidence.
>
> I wanted to double-check this with the parquet dev community? Is it true
> that definitions levels need to be written before repetition levels? If
> true then the parquet documentation is wrong then I am happy to PR a fix.
>
> Regards
> --
> ZJ
>
> zhuojia.dai@gmail.com
>