You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gil Vernik <GI...@il.ibm.com> on 2016/08/07 10:33:56 UTC
few issues with readAll(FSDataInputStream f) from ParquetFileReader
Hi all,
I keep Parquet objects stored in the OpenStack Swift based object store
and then I read them with Spark.
From time to time i see the following exceptions:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:757)
Experimenting a bit, seems i know the origin of this exception.
See the code from (
https://raw.githubusercontent.com/Parquet/parquet-mr/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java
)
public List<Chunk> readAll(FSDataInputStream f) throws IOException {
List<Chunk> result = new ArrayList<Chunk>(chunks.size());
f.seek(offset);
byte[] chunksBytes = new byte[length];
f.readFully(chunksBytes);
When offset already near the end of the Parquet object, then chunkBytes is
larger then the actual data that is left to read.
If this happens, then there is java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
This seems to be normal behavior and even documented in DataInputStream:
* @exception EOFException if this input stream reaches the end
before reading all the bytes.
The question, what is the behavior you expect to get when offset +
length(chunkbytes) > actuall file size ?
I have some cases when it's not possible to read Parquet objects, because
of the current readAll(..) implementation: offset near the the end of the
object, f.seek(offset) position it there and then f.readFully(chunkBytes)
throws EOF, because chunksBytes array is larger then the actual bytes that
left to read till the end of the object.
Thanks
Gil.
Re: few issues with readAll(FSDataInputStream f) from ParquetFileReader
Posted by Gil Vernik <GI...@il.ibm.com>.
In my case the issue is when
(offset + length(chunkbytes)) > (actual file size)
Then EOF is thrown.
I don't think it's related to PARQUET-400. Do you feel it's related?
Thanks
Gil.
From: Piyush Narang <pn...@twitter.com.INVALID>
To: dev@parquet.apache.org
Date: 09/08/2016 20:37
Subject: Re: few issues with readAll(FSDataInputStream f) from
ParquetFileReader
Are you using the tip of head? There have been some issues with ByteBuffer
reads - https://issues.apache.org/jira/browse/PARQUET-400. The error there
seems different from what you're seeing though so it might be a different
issue: "can not read class org.apache.parquet.format.PageHeader: Required
field 'uncompressed_page_size' was not found in serialized data!
Struct:PageHeader(type:null,
uncompressed_page_size:0, compressed_page_size:0)"
On Tue, Aug 9, 2016 at 1:27 AM, Gil Vernik <GI...@il.ibm.com> wrote:
> Any response?
>
> Thanks
> Gil.
>
>
>
>
> From: Gil Vernik/Haifa/IBM@IBMIL
> To: dev@parquet.apache.org
> Date: 07/08/2016 13:34
> Subject: few issues with readAll(FSDataInputStream f) from
> ParquetFileReader
>
>
>
> Hi all,
>
> I keep Parquet objects stored in the OpenStack Swift based object store
> and then I read them with Spark.
> From time to time i see the following exceptions:
>
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at
>
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(
> ParquetFileReader.java:757)
>
> Experimenting a bit, seems i know the origin of this exception.
>
> See the code from (
> https://raw.githubusercontent.com/Parquet/parquet-mr/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java
>
> )
> public List<Chunk> readAll(FSDataInputStream f) throws IOException {
> List<Chunk> result = new ArrayList<Chunk>(chunks.size());
> f.seek(offset);
> byte[] chunksBytes = new byte[length];
> f.readFully(chunksBytes);
>
> When offset already near the end of the Parquet object, then chunkBytes
is
>
> larger then the actual data that is left to read.
> If this happens, then there is java.io.EOFException
>
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>
>
> This seems to be normal behavior and even documented in DataInputStream:
> * @exception EOFException if this input stream reaches the end
> before reading all the bytes.
>
> The question, what is the behavior you expect to get when offset +
> length(chunkbytes) > actuall file size ?
>
> I have some cases when it's not possible to read Parquet objects,
because
> of the current readAll(..) implementation: offset near the the end of
the
> object, f.seek(offset) position it there and then
f.readFully(chunkBytes)
> throws EOF, because chunksBytes array is larger then the actual bytes
that
>
> left to read till the end of the object.
>
> Thanks
> Gil.
>
>
>
>
>
>
--
- Piyush
Re: few issues with readAll(FSDataInputStream f) from ParquetFileReader
Posted by Piyush Narang <pn...@twitter.com.INVALID>.
Are you using the tip of head? There have been some issues with ByteBuffer
reads - https://issues.apache.org/jira/browse/PARQUET-400. The error there
seems different from what you're seeing though so it might be a different
issue: "can not read class org.apache.parquet.format.PageHeader: Required
field 'uncompressed_page_size' was not found in serialized data!
Struct:PageHeader(type:null,
uncompressed_page_size:0, compressed_page_size:0)"
On Tue, Aug 9, 2016 at 1:27 AM, Gil Vernik <GI...@il.ibm.com> wrote:
> Any response?
>
> Thanks
> Gil.
>
>
>
>
> From: Gil Vernik/Haifa/IBM@IBMIL
> To: dev@parquet.apache.org
> Date: 07/08/2016 13:34
> Subject: few issues with readAll(FSDataInputStream f) from
> ParquetFileReader
>
>
>
> Hi all,
>
> I keep Parquet objects stored in the OpenStack Swift based object store
> and then I read them with Spark.
> From time to time i see the following exceptions:
>
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(
> ParquetFileReader.java:757)
>
> Experimenting a bit, seems i know the origin of this exception.
>
> See the code from (
> https://raw.githubusercontent.com/Parquet/parquet-mr/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java
>
> )
> public List<Chunk> readAll(FSDataInputStream f) throws IOException {
> List<Chunk> result = new ArrayList<Chunk>(chunks.size());
> f.seek(offset);
> byte[] chunksBytes = new byte[length];
> f.readFully(chunksBytes);
>
> When offset already near the end of the Parquet object, then chunkBytes is
>
> larger then the actual data that is left to read.
> If this happens, then there is java.io.EOFException
>
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>
>
> This seems to be normal behavior and even documented in DataInputStream:
> * @exception EOFException if this input stream reaches the end
> before reading all the bytes.
>
> The question, what is the behavior you expect to get when offset +
> length(chunkbytes) > actuall file size ?
>
> I have some cases when it's not possible to read Parquet objects, because
> of the current readAll(..) implementation: offset near the the end of the
> object, f.seek(offset) position it there and then f.readFully(chunkBytes)
> throws EOF, because chunksBytes array is larger then the actual bytes that
>
> left to read till the end of the object.
>
> Thanks
> Gil.
>
>
>
>
>
>
--
- Piyush
Re: few issues with readAll(FSDataInputStream f) from ParquetFileReader
Posted by Gil Vernik <GI...@il.ibm.com>.
Any response?
Thanks
Gil.
From: Gil Vernik/Haifa/IBM@IBMIL
To: dev@parquet.apache.org
Date: 07/08/2016 13:34
Subject: few issues with readAll(FSDataInputStream f) from
ParquetFileReader
Hi all,
I keep Parquet objects stored in the OpenStack Swift based object store
and then I read them with Spark.
From time to time i see the following exceptions:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:757)
Experimenting a bit, seems i know the origin of this exception.
See the code from (
https://raw.githubusercontent.com/Parquet/parquet-mr/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java
)
public List<Chunk> readAll(FSDataInputStream f) throws IOException {
List<Chunk> result = new ArrayList<Chunk>(chunks.size());
f.seek(offset);
byte[] chunksBytes = new byte[length];
f.readFully(chunksBytes);
When offset already near the end of the Parquet object, then chunkBytes is
larger then the actual data that is left to read.
If this happens, then there is java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
This seems to be normal behavior and even documented in DataInputStream:
* @exception EOFException if this input stream reaches the end
before reading all the bytes.
The question, what is the behavior you expect to get when offset +
length(chunkbytes) > actuall file size ?
I have some cases when it's not possible to read Parquet objects, because
of the current readAll(..) implementation: offset near the the end of the
object, f.seek(offset) position it there and then f.readFully(chunkBytes)
throws EOF, because chunksBytes array is larger then the actual bytes that
left to read till the end of the object.
Thanks
Gil.