You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ankur Srivastava <an...@gmail.com> on 2016/09/30 02:31:27 UTC

FetchFailed exception with Spark 1.6

Hi,

I am running a simple job on Spark 1.6 in which I am trying to leftOuterJoin a
big RDD with a smaller one. I am not yet broadcasting the smaller RDD yet
but I am stilling running into FetchFailed errors with finally the job
getting killed.

I have already partitioned the data to 5000 partitions and every time the
job runs with no errors for the first 2K to 3K tasks but then starts
getting this exception.

If I look further in the stack trace for some I see errors like below but
if there is any network issue the initial 2k+ tasks should not have
succeeded.

Caused by: java.io.IOException: Connection reset by peer


Caused by: java.io.IOException: Failed to connect to <host>

I am running on Yarn cluster manager with 200 executors and 6GB of executor
and driver heap. I had in the last run seen errors related to
spark.yarn.executor.memoryOverhead, so I have set it to 1.5 GB and do not
see those errors.


Any help will be much appreciated.

Thanks
Ankur