You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Gil Vernik <GI...@il.ibm.com> on 2016/08/07 10:33:56 UTC

few issues with readAll(FSDataInputStream f) from ParquetFileReader

Hi all,

I keep Parquet objects stored in the OpenStack Swift based object store 
and then I read them with Spark.
From time to time i see the following exceptions:

java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:757)

Experimenting a bit, seems i know the origin of this exception.

See the code from (
https://raw.githubusercontent.com/Parquet/parquet-mr/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java 
)
    public List<Chunk> readAll(FSDataInputStream f) throws IOException {
      List<Chunk> result = new ArrayList<Chunk>(chunks.size());
      f.seek(offset);
      byte[] chunksBytes = new byte[length];
      f.readFully(chunksBytes);

When offset already near the end of the Parquet object, then chunkBytes is 
larger then the actual data that is left to read.
If this happens, then there is java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)


This seems to be normal behavior and even documented in DataInputStream:
     * @exception  EOFException  if this input stream reaches the end 
before reading all the bytes.

The question, what is the behavior you expect to get when offset + 
length(chunkbytes) > actuall file size ? 

I have some cases when it's not possible to read Parquet objects, because 
of the current readAll(..) implementation: offset near the the end of the 
object, f.seek(offset) position it there and then f.readFully(chunkBytes) 
throws EOF, because chunksBytes array is larger then the actual bytes that 
left to read till the end of the object.

Thanks
Gil.

Re: few issues with readAll(FSDataInputStream f) from ParquetFileReader

Posted by Gil Vernik <GI...@il.ibm.com>.

In my case the issue is when 
(offset + length(chunkbytes)) > (actual file size)
Then EOF is thrown.

I don't think it's related to PARQUET-400. Do you feel it's related?

Thanks
Gil.




From:   Piyush Narang <pn...@twitter.com.INVALID>
To:     dev@parquet.apache.org
Date:   09/08/2016 20:37
Subject:        Re: few issues with readAll(FSDataInputStream f) from 
ParquetFileReader



Are you using the tip of head? There have been some issues with ByteBuffer
reads - https://issues.apache.org/jira/browse/PARQUET-400. The error there
seems different from what you're seeing though so it might be a different
issue: "can not read class org.apache.parquet.format.PageHeader: Required
field 'uncompressed_page_size' was not found in serialized data!
Struct:PageHeader(type:null,
uncompressed_page_size:0, compressed_page_size:0)"


On Tue, Aug 9, 2016 at 1:27 AM, Gil Vernik <GI...@il.ibm.com> wrote:

> Any response?
>
> Thanks
> Gil.
>
>
>
>
> From:   Gil Vernik/Haifa/IBM@IBMIL
> To:     dev@parquet.apache.org
> Date:   07/08/2016 13:34
> Subject:        few issues with readAll(FSDataInputStream f) from
> ParquetFileReader
>
>
>
> Hi all,
>
> I keep Parquet objects stored in the OpenStack Swift based object store
> and then I read them with Spark.
> From time to time i see the following exceptions:
>
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at
> 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(
> ParquetFileReader.java:757)
>
> Experimenting a bit, seems i know the origin of this exception.
>
> See the code from (
> https://raw.githubusercontent.com/Parquet/parquet-mr/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java
>
> )
>     public List<Chunk> readAll(FSDataInputStream f) throws IOException {
>       List<Chunk> result = new ArrayList<Chunk>(chunks.size());
>       f.seek(offset);
>       byte[] chunksBytes = new byte[length];
>       f.readFully(chunksBytes);
>
> When offset already near the end of the Parquet object, then chunkBytes 
is
>
> larger then the actual data that is left to read.
> If this happens, then there is java.io.EOFException
>
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>
>
> This seems to be normal behavior and even documented in DataInputStream:
>      * @exception  EOFException  if this input stream reaches the end
> before reading all the bytes.
>
> The question, what is the behavior you expect to get when offset +
> length(chunkbytes) > actuall file size ?
>
> I have some cases when it's not possible to read Parquet objects, 
because
> of the current readAll(..) implementation: offset near the the end of 
the
> object, f.seek(offset) position it there and then 
f.readFully(chunkBytes)
> throws EOF, because chunksBytes array is larger then the actual bytes 
that
>
> left to read till the end of the object.
>
> Thanks
> Gil.
>
>
>
>
>
>


-- 
- Piyush

Re: few issues with readAll(FSDataInputStream f) from ParquetFileReader

Posted by Piyush Narang <pn...@twitter.com.INVALID>.

Are you using the tip of head? There have been some issues with ByteBuffer
reads - https://issues.apache.org/jira/browse/PARQUET-400. The error there
seems different from what you're seeing though so it might be a different
issue: "can not read class org.apache.parquet.format.PageHeader: Required
field 'uncompressed_page_size' was not found in serialized data!
Struct:PageHeader(type:null,
uncompressed_page_size:0, compressed_page_size:0)"


On Tue, Aug 9, 2016 at 1:27 AM, Gil Vernik <GI...@il.ibm.com> wrote:

> Any response?
>
> Thanks
> Gil.
>
>
>
>
> From:   Gil Vernik/Haifa/IBM@IBMIL
> To:     dev@parquet.apache.org
> Date:   07/08/2016 13:34
> Subject:        few issues with readAll(FSDataInputStream f) from
> ParquetFileReader
>
>
>
> Hi all,
>
> I keep Parquet objects stored in the OpenStack Swift based object store
> and then I read them with Spark.
> From time to time i see the following exceptions:
>
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(
> ParquetFileReader.java:757)
>
> Experimenting a bit, seems i know the origin of this exception.
>
> See the code from (
> https://raw.githubusercontent.com/Parquet/parquet-mr/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java
>
> )
>     public List<Chunk> readAll(FSDataInputStream f) throws IOException {
>       List<Chunk> result = new ArrayList<Chunk>(chunks.size());
>       f.seek(offset);
>       byte[] chunksBytes = new byte[length];
>       f.readFully(chunksBytes);
>
> When offset already near the end of the Parquet object, then chunkBytes is
>
> larger then the actual data that is left to read.
> If this happens, then there is java.io.EOFException
>
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>
>
> This seems to be normal behavior and even documented in DataInputStream:
>      * @exception  EOFException  if this input stream reaches the end
> before reading all the bytes.
>
> The question, what is the behavior you expect to get when offset +
> length(chunkbytes) > actuall file size ?
>
> I have some cases when it's not possible to read Parquet objects, because
> of the current readAll(..) implementation: offset near the the end of the
> object, f.seek(offset) position it there and then f.readFully(chunkBytes)
> throws EOF, because chunksBytes array is larger then the actual bytes that
>
> left to read till the end of the object.
>
> Thanks
> Gil.
>
>
>
>
>
>


-- 
- Piyush

Re: few issues with readAll(FSDataInputStream f) from ParquetFileReader

Posted by Gil Vernik <GI...@il.ibm.com>.

Any response? 

Thanks
Gil.




From:   Gil Vernik/Haifa/IBM@IBMIL
To:     dev@parquet.apache.org
Date:   07/08/2016 13:34
Subject:        few issues with readAll(FSDataInputStream f) from 
ParquetFileReader



Hi all,

I keep Parquet objects stored in the OpenStack Swift based object store 
and then I read them with Spark.
From time to time i see the following exceptions:

java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:757)

Experimenting a bit, seems i know the origin of this exception.

See the code from (
https://raw.githubusercontent.com/Parquet/parquet-mr/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java 

)
    public List<Chunk> readAll(FSDataInputStream f) throws IOException {
      List<Chunk> result = new ArrayList<Chunk>(chunks.size());
      f.seek(offset);
      byte[] chunksBytes = new byte[length];
      f.readFully(chunksBytes);

When offset already near the end of the Parquet object, then chunkBytes is 

larger then the actual data that is left to read.
If this happens, then there is java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)


This seems to be normal behavior and even documented in DataInputStream:
     * @exception  EOFException  if this input stream reaches the end 
before reading all the bytes.

The question, what is the behavior you expect to get when offset + 
length(chunkbytes) > actuall file size ? 

I have some cases when it's not possible to read Parquet objects, because 
of the current readAll(..) implementation: offset near the the end of the 
object, f.seek(offset) position it there and then f.readFully(chunkBytes) 
throws EOF, because chunksBytes array is larger then the actual bytes that 

left to read till the end of the object.

Thanks
Gil.