You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Jozef Vilcek <jo...@gmail.com> on 2022/08/04 10:08:47 UTC

Fail to read back written large parquet file

I came across a case where a job writes out a data set in parquet format
and it can not be read back as it appears to be corrupted.

Files fail to read back if their size is going over 2GB. If I set the job
to produce more smaller files from exactly the same input, all is good.

Job write to parquet Avro messages via `parquet-avro` and `parquet-mr`. It
does happen with v1.10.1 and v1.12.0.

Read error is:

Cannot seek to negative offset
java.io.EOFException: Cannot seek to negative offset
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at
org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
at
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
at
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)

When digging a bit into the read, the code materializing
`ColumnChunkMetaData` is here [1] starting to see negative values for
`firstDataPage`. Printing some info from `reader.getRowGroups` yields:


startingPos=4, totalBytesSize=519551822, rowCount=2300100
startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
...
startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372



Unfortunately, I was not able to reproduce it locally by taking avro schema
and generating random inputs and writing them out to a local file. Every
time, compressed or uncompressed, 3GB file was reading back correctly.

I am looking for help in finding a solution of hints in debugging this, as
I am out of clues to try to pinpoint and reproduce the problem.

Thanks!

[1]
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127

Re: Fail to read back written large parquet file

Posted by Chao Sun <su...@apache.org>.

Jozef, feel free to open a Parquet JIRA to give the issue more
details. Ideally the writer should recover itself and produce the
correct result, but I don't have enough context yet so not sure
whether it's doable.

On Mon, Aug 8, 2022 at 1:53 AM Jozef Vilcek <jo...@gmail.com> wrote:
>
> Found it. Problem in my case was not with Parquet but my implementation of
> the `OutputFile` wrapper providing `PositionOutputStream`.
>
> Would it make sense to do changes to the writer to crash on negative
> offsets rather than continue and produce unreadable results.
>
> On Fri, Aug 5, 2022 at 8:42 PM Chao Sun <su...@apache.org> wrote:
>
> > Seems the file was corrupted during write. There's a similar issue
> > https://issues.apache.org/jira/browse/PARQUET-2164 we found recently.
> >
> > On Fri, Aug 5, 2022 at 3:40 AM Steve Loughran
> > <st...@cloudera.com.invalid> wrote:
> > >
> > > tha has to be an integer wraparound...something is using a signed int for
> > > position, so when it goes above 2GB it goes negative, and a seek(negative
> > > value) is rejected.
> > >
> > > fix: find the variable and make it a long
> > >
> > >
> > >
> > > On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek <jo...@gmail.com> wrote:
> > >
> > > > I came across a case where a job writes out a data set in parquet
> > format
> > > > and it can not be read back as it appears to be corrupted.
> > > >
> > > > Files fail to read back if their size is going over 2GB. If I set the
> > job
> > > > to produce more smaller files from exactly the same input, all is good.
> > > >
> > > > Job write to parquet Avro messages via `parquet-avro` and
> > `parquet-mr`. It
> > > > does happen with v1.10.1 and v1.12.0.
> > > >
> > > > Read error is:
> > > >
> > > > Cannot seek to negative offset
> > > > java.io.EOFException: Cannot seek to negative offset
> > > > at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
> > > > at
> > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
> > > > at
> > > >
> > > >
> > org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
> > > > at
> > > >
> > > >
> > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
> > > > at
> > > >
> > > >
> > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> > > >
> > > > When digging a bit into the read, the code materializing
> > > > `ColumnChunkMetaData` is here [1] starting to see negative values for
> > > > `firstDataPage`. Printing some info from `reader.getRowGroups` yields:
> > > >
> > > >
> > > > startingPos=4, totalBytesSize=519551822, rowCount=2300100
> > > > startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
> > > > ...
> > > > startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
> > > > startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
> > > > startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
> > > > startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
> > > > startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
> > > > startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
> > > > startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
> > > > startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
> > > > startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372
> > > >
> > > >
> > > >
> > > > Unfortunately, I was not able to reproduce it locally by taking avro
> > schema
> > > > and generating random inputs and writing them out to a local file.
> > Every
> > > > time, compressed or uncompressed, 3GB file was reading back correctly.
> > > >
> > > > I am looking for help in finding a solution of hints in debugging
> > this, as
> > > > I am out of clues to try to pinpoint and reproduce the problem.
> > > >
> > > > Thanks!
> > > >
> > > > [1]
> > > >
> > > >
> > https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127
> > > >
> >

Re: Fail to read back written large parquet file

Posted by Jozef Vilcek <jo...@gmail.com>.

Found it. Problem in my case was not with Parquet but my implementation of
the `OutputFile` wrapper providing `PositionOutputStream`.

Would it make sense to do changes to the writer to crash on negative
offsets rather than continue and produce unreadable results.

On Fri, Aug 5, 2022 at 8:42 PM Chao Sun <su...@apache.org> wrote:

> Seems the file was corrupted during write. There's a similar issue
> https://issues.apache.org/jira/browse/PARQUET-2164 we found recently.
>
> On Fri, Aug 5, 2022 at 3:40 AM Steve Loughran
> <st...@cloudera.com.invalid> wrote:
> >
> > tha has to be an integer wraparound...something is using a signed int for
> > position, so when it goes above 2GB it goes negative, and a seek(negative
> > value) is rejected.
> >
> > fix: find the variable and make it a long
> >
> >
> >
> > On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek <jo...@gmail.com> wrote:
> >
> > > I came across a case where a job writes out a data set in parquet
> format
> > > and it can not be read back as it appears to be corrupted.
> > >
> > > Files fail to read back if their size is going over 2GB. If I set the
> job
> > > to produce more smaller files from exactly the same input, all is good.
> > >
> > > Job write to parquet Avro messages via `parquet-avro` and
> `parquet-mr`. It
> > > does happen with v1.10.1 and v1.12.0.
> > >
> > > Read error is:
> > >
> > > Cannot seek to negative offset
> > > java.io.EOFException: Cannot seek to negative offset
> > > at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
> > > at
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
> > > at
> > >
> > >
> org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
> > > at
> > >
> > >
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
> > > at
> > >
> > >
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> > >
> > > When digging a bit into the read, the code materializing
> > > `ColumnChunkMetaData` is here [1] starting to see negative values for
> > > `firstDataPage`. Printing some info from `reader.getRowGroups` yields:
> > >
> > >
> > > startingPos=4, totalBytesSize=519551822, rowCount=2300100
> > > startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
> > > ...
> > > startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
> > > startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
> > > startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
> > > startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
> > > startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
> > > startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
> > > startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
> > > startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
> > > startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372
> > >
> > >
> > >
> > > Unfortunately, I was not able to reproduce it locally by taking avro
> schema
> > > and generating random inputs and writing them out to a local file.
> Every
> > > time, compressed or uncompressed, 3GB file was reading back correctly.
> > >
> > > I am looking for help in finding a solution of hints in debugging
> this, as
> > > I am out of clues to try to pinpoint and reproduce the problem.
> > >
> > > Thanks!
> > >
> > > [1]
> > >
> > >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127
> > >
>

Re: Fail to read back written large parquet file

Posted by Chao Sun <su...@apache.org>.

Seems the file was corrupted during write. There's a similar issue
https://issues.apache.org/jira/browse/PARQUET-2164 we found recently.

On Fri, Aug 5, 2022 at 3:40 AM Steve Loughran
<st...@cloudera.com.invalid> wrote:
>
> tha has to be an integer wraparound...something is using a signed int for
> position, so when it goes above 2GB it goes negative, and a seek(negative
> value) is rejected.
>
> fix: find the variable and make it a long
>
>
>
> On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek <jo...@gmail.com> wrote:
>
> > I came across a case where a job writes out a data set in parquet format
> > and it can not be read back as it appears to be corrupted.
> >
> > Files fail to read back if their size is going over 2GB. If I set the job
> > to produce more smaller files from exactly the same input, all is good.
> >
> > Job write to parquet Avro messages via `parquet-avro` and `parquet-mr`. It
> > does happen with v1.10.1 and v1.12.0.
> >
> > Read error is:
> >
> > Cannot seek to negative offset
> > java.io.EOFException: Cannot seek to negative offset
> > at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
> > at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
> > at
> >
> > org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
> > at
> >
> > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
> > at
> >
> > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> >
> > When digging a bit into the read, the code materializing
> > `ColumnChunkMetaData` is here [1] starting to see negative values for
> > `firstDataPage`. Printing some info from `reader.getRowGroups` yields:
> >
> >
> > startingPos=4, totalBytesSize=519551822, rowCount=2300100
> > startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
> > ...
> > startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
> > startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
> > startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
> > startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
> > startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
> > startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
> > startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
> > startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
> > startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372
> >
> >
> >
> > Unfortunately, I was not able to reproduce it locally by taking avro schema
> > and generating random inputs and writing them out to a local file. Every
> > time, compressed or uncompressed, 3GB file was reading back correctly.
> >
> > I am looking for help in finding a solution of hints in debugging this, as
> > I am out of clues to try to pinpoint and reproduce the problem.
> >
> > Thanks!
> >
> > [1]
> >
> > https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127
> >

Re: Fail to read back written large parquet file

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

tha has to be an integer wraparound...something is using a signed int for
position, so when it goes above 2GB it goes negative, and a seek(negative
value) is rejected.

fix: find the variable and make it a long



On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek <jo...@gmail.com> wrote:

> I came across a case where a job writes out a data set in parquet format
> and it can not be read back as it appears to be corrupted.
>
> Files fail to read back if their size is going over 2GB. If I set the job
> to produce more smaller files from exactly the same input, all is good.
>
> Job write to parquet Avro messages via `parquet-avro` and `parquet-mr`. It
> does happen with v1.10.1 and v1.12.0.
>
> Read error is:
>
> Cannot seek to negative offset
> java.io.EOFException: Cannot seek to negative offset
> at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
> at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
> at
>
> org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
> at
>
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
> at
>
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>
> When digging a bit into the read, the code materializing
> `ColumnChunkMetaData` is here [1] starting to see negative values for
> `firstDataPage`. Printing some info from `reader.getRowGroups` yields:
>
>
> startingPos=4, totalBytesSize=519551822, rowCount=2300100
> startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
> ...
> startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
> startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
> startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
> startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
> startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
> startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
> startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
> startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
> startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372
>
>
>
> Unfortunately, I was not able to reproduce it locally by taking avro schema
> and generating random inputs and writing them out to a local file. Every
> time, compressed or uncompressed, 3GB file was reading back correctly.
>
> I am looking for help in finding a solution of hints in debugging this, as
> I am out of clues to try to pinpoint and reproduce the problem.
>
> Thanks!
>
> [1]
>
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127
>