You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Pavan Lanka <pl...@apple.com.INVALID> on 2021/09/30 16:41:11 UTC

EOFException when performing ORC Reads on AWS S3 using s3a://

Wanted to share this information in case anyone else runs into a similar problem.

Problem
——————————————
I was getting the following exception when an ORC Read was taking place
```text
Caused by: java.io.IOException: Problem opening stripe 0 footer in s3a://<snip>.
 at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:349)
 at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:878)
 at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:125)
 ... 24 more
Caused by: java.io.EOFException: End of file reached before reading fully.
 at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702)
 at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
 at org.apache.orc.impl.RecordReaderUtils.readRanges(RecordReaderUtils.java:417)
 at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:484)
 at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:102)
 at org.apache.orc.impl.reader.StripePlanner.readData(StripePlanner.java:177)
 at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1210)
 at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1250)
 at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1293)
 at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:344)
 ... 26 more
```

In terms of context the following was used:
* Apache Spark 3.1.2
* Apache Iceberg 0.11.1
* Apache Hadoop 3.2.0
* Apache ORC 1.7.0
* The Iceberg tables were served out of AWS S3 using the S3AFileSystem from `hadoop-aws`

The failure was encountered when a join was taking place on two ORC Tables. It was reproducible and failing on the same set of files. However a similar read on the individual table did not result in this failure

Solution
————————————
We validated that the ORC Read planning was proper by enhancing the exception message to include the offset and length. Once this was confirmed and we started exploring the `hadoop-aws` artifact that has the S3AFileSystem and the S3AInputStream.

Came across [HADOOP-16109][1] and is well documented on the source of this issue. Upon upgrade of Apache Hadoop to 3.2.1 which includes this patch the failure is no longer encountered.

Hope this helps.

Regards,
Pavan

[1]: https://issues.apache.org/jira/browse/HADOOP-16109



Re: EOFException when performing ORC Reads on AWS S3 using s3a://

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you for sharing, Pavan.

To be clear, it was one of the known general Hadoop issues.

Apache Spark 3.2.0 RC6 is using Hadoop 3.3.1.

Dongjoon.

On Thu, Sep 30, 2021 at 9:41 AM Pavan Lanka <pl...@apple.com.invalid>
wrote:

> Wanted to share this information in case anyone else runs into a similar
> problem.
>
> Problem
> ——————————————
> I was getting the following exception when an ORC Read was taking place
> ```text
> Caused by: java.io.IOException: Problem opening stripe 0 footer in
> s3a://<snip>.
>  at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:349)
>  at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:878)
>  at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:125)
>  ... 24 more
> Caused by: java.io.EOFException: End of file reached before reading fully.
>  at
> org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702)
>  at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
>  at
> org.apache.orc.impl.RecordReaderUtils.readRanges(RecordReaderUtils.java:417)
>  at
> org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:484)
>  at
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:102)
>  at
> org.apache.orc.impl.reader.StripePlanner.readData(StripePlanner.java:177)
>  at
> org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1210)
>  at
> org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1250)
>  at
> org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1293)
>  at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:344)
>  ... 26 more
> ```
>
> In terms of context the following was used:
> * Apache Spark 3.1.2
> * Apache Iceberg 0.11.1
> * Apache Hadoop 3.2.0
> * Apache ORC 1.7.0
> * The Iceberg tables were served out of AWS S3 using the S3AFileSystem
> from `hadoop-aws`
>
> The failure was encountered when a join was taking place on two ORC
> Tables. It was reproducible and failing on the same set of files. However a
> similar read on the individual table did not result in this failure
>
> Solution
> ————————————
> We validated that the ORC Read planning was proper by enhancing the
> exception message to include the offset and length. Once this was confirmed
> and we started exploring the `hadoop-aws` artifact that has the
> S3AFileSystem and the S3AInputStream.
>
> Came across [HADOOP-16109][1] and is well documented on the source of this
> issue. Upon upgrade of Apache Hadoop to 3.2.1 which includes this patch the
> failure is no longer encountered.
>
> Hope this helps.
>
> Regards,
> Pavan
>
> [1]: https://issues.apache.org/jira/browse/HADOOP-16109
>
>
>