You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arghya Saha (Jira)" <ji...@apache.org> on 2021/06/09 11:54:00 UTC

[jira] [Created] (HADOOP-17755) EOF reached error reading ORC file on S3A

Arghya Saha created HADOOP-17755:
------------------------------------

             Summary: EOF reached error reading ORC file on S3A
                 Key: HADOOP-17755
                 URL: https://issues.apache.org/jira/browse/HADOOP-17755
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 3.2.0
         Environment: Hadoop 3.2.0
            Reporter: Arghya Saha


Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s and using s3a

I have around 700 GB of data to read and around 200 executors (5 vCore and 30G each).

Its able to read most of the files in problematic stage (Scan orc => Filter => Project) but is failing with few files at the end with below error. 

I am able to read and rewrite the specific file mentioned which suggest the file is not corrupted.

Let me know if further information is required

 
{code:java}
java.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: End of file reached before reading fully. at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285) at org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) ... 20 more
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org