You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Zhuo Jia Dai <zh...@gmail.com> on 2020/05/22 03:45:01 UTC
definition written before repetition?
I raise this issue https://github.com/JuliaIO/Parquet.jl/issues/60
where the official parquet documentation states that repetition levels are
written before definition levels. However, in the Julia Parquet package the
parquet implementation reads definition before the repetition levels and
the author insists on him being right but did not provide further evidence.
I wanted to double-check this with the parquet dev community? Is it true
that definitions levels need to be written before repetition levels? If
true then the parquet documentation is wrong then I am happy to PR a fix.
Regards
--
ZJ
zhuojia.dai@gmail.com
Re: definition written before repetition?
Posted by Wes McKinney <we...@gmail.com>.
Sorry, I'm wrong -- C++ is doing it correctly, I was looking at the
wrong code. False alarm!
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L685
I was shocked that such a blatant correctness issue might have existed
but since people have been able to read nested data files with Spark
and other systems everything is fine in C++.
On Fri, May 22, 2020 at 12:53 PM Wes McKinney <we...@gmail.com> wrote:
>
> If that's the case (and according to the Format documentation it is)
> then we are doing it incorrectly in C++. How depressing
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1097
>
> This is unfortunately what happens when you don't have more rigorous
> integration tests.
>
>
> On Fri, May 22, 2020 at 3:14 AM Gabor Szadovszky <ga...@apache.org> wrote:
> >
> > Hi ZJ,
> >
> > parquet-mr clearly writes repetition levels and definition levels according
> > to the specification. See the following code references.
> > For V1 pages:
> > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
> > For V2 pages:
> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
> >
> > Regards,
> > Gabor
> >
> >
> > On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai <zh...@gmail.com> wrote:
> >
> > > I raise this issue https://github.com/JuliaIO/Parquet.jl/issues/60
> > >
> > > where the official parquet documentation states that repetition levels are
> > > written before definition levels. However, in the Julia Parquet package the
> > > parquet implementation reads definition before the repetition levels and
> > > the author insists on him being right but did not provide further evidence.
> > >
> > > I wanted to double-check this with the parquet dev community? Is it true
> > > that definitions levels need to be written before repetition levels? If
> > > true then the parquet documentation is wrong then I am happy to PR a fix.
> > >
> > > Regards
> > > --
> > > ZJ
> > >
> > > zhuojia.dai@gmail.com
> > >
Re: definition written before repetition?
Posted by Wes McKinney <we...@gmail.com>.
If that's the case (and according to the Format documentation it is)
then we are doing it incorrectly in C++. How depressing
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1097
This is unfortunately what happens when you don't have more rigorous
integration tests.
On Fri, May 22, 2020 at 3:14 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> Hi ZJ,
>
> parquet-mr clearly writes repetition levels and definition levels according
> to the specification. See the following code references.
> For V1 pages:
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
> For V2 pages:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
>
> Regards,
> Gabor
>
>
> On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai <zh...@gmail.com> wrote:
>
> > I raise this issue https://github.com/JuliaIO/Parquet.jl/issues/60
> >
> > where the official parquet documentation states that repetition levels are
> > written before definition levels. However, in the Julia Parquet package the
> > parquet implementation reads definition before the repetition levels and
> > the author insists on him being right but did not provide further evidence.
> >
> > I wanted to double-check this with the parquet dev community? Is it true
> > that definitions levels need to be written before repetition levels? If
> > true then the parquet documentation is wrong then I am happy to PR a fix.
> >
> > Regards
> > --
> > ZJ
> >
> > zhuojia.dai@gmail.com
> >
Re: definition written before repetition?
Posted by Gabor Szadovszky <ga...@apache.org>.
Hi ZJ,
parquet-mr clearly writes repetition levels and definition levels according
to the specification. See the following code references.
For V1 pages:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
For V2 pages:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
Regards,
Gabor
On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai <zh...@gmail.com> wrote:
> I raise this issue https://github.com/JuliaIO/Parquet.jl/issues/60
>
> where the official parquet documentation states that repetition levels are
> written before definition levels. However, in the Julia Parquet package the
> parquet implementation reads definition before the repetition levels and
> the author insists on him being right but did not provide further evidence.
>
> I wanted to double-check this with the parquet dev community? Is it true
> that definitions levels need to be written before repetition levels? If
> true then the parquet documentation is wrong then I am happy to PR a fix.
>
> Regards
> --
> ZJ
>
> zhuojia.dai@gmail.com
>