You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2021/07/02 12:57:00 UTC

[jira] [Resolved] (HADOOP-17755) EOF reached error reading ORC file on S3A

     [ https://issues.apache.org/jira/browse/HADOOP-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran resolved HADOOP-17755.
-------------------------------------
    Resolution: Duplicate

> EOF reached error reading ORC file on S3A
> -----------------------------------------
>
>                 Key: HADOOP-17755
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17755
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.0
>         Environment: Hadoop 3.2.0
>            Reporter: Arghya Saha
>            Priority: Major
>
> Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s and using s3a
> I have around 700 GB of data to read and around 200 executors (5 vCore and 30G each).
> Its able to read most of the files in problematic stage (Scan orc => Filter => Project) but is failing with few files at the end with below error.  The size of the file mentioned in error is around 140 MB and all other files are of similar size.
> I am able to read and rewrite the specific file mentioned which suggest the file is not corrupted.
> Let me know if further information is required
>  
> {code:java}
> java.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: End of file reached before reading fully. at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285) at org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) ... 20 more
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org