You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Fernando Pereira <fe...@gmail.com> on 2018/01/15 10:32:02 UTC

End of Stream errors in shuffle

Hi,

I'm facing a very strange error that occurs halfway of long execution Spark
SQL jobs:

18/01/12 22:14:30 ERROR Utils: Aborting task
java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes
expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
at
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
(...)

Since I get this in several jobs, I wonder if it might be a problem at the
comm layer.
Did anyone face a similar problem?

It always happens in a job which does a shuffle of 200GB reading then in
partitions of ~64MB for a groupBy. And it is weird that it only fails when
it processed over 1000 partitions (16 cores on one node)

I even tried changing the spark.shuffle.file.buffer config but it just
seems to change the point when it occurs.

Really would appreciate some hints - what it could be, what to try, test,
how to debug - as I feel pretty much blocked here.

Thanks in advance
Fernando

Re: End of Stream errors in shuffle

Posted by pratyush04 <pr...@gmail.com>.
Hi Fernando,

There is a limit of 2GB on blocks for shuffle, since you say the job fails
while doing shuffle of 200GB data, it might be due to this.
These links give more idea about this:
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-2GB-limit-for-partitions-td10435.html
https://issues.apache.org/jira/browse/SPARK-5928

Thanks,
Pratyush




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org