You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Brett Meyer <Br...@crowdstrike.com> on 2014/11/21 18:47:36 UTC

Many retries for Python job

I¹m running a Python script with spark-submit on top of YARN on an EMR
cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
from S3, and then does some transformations and filtering, followed by some
aggregate counts.  During Stage 2 of the job, everything looks to complete
just fine with no executor failures or resubmissions, but when Stage 3
starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
successfully start.  The whole application eventually completes, but there
is an addition of about 1+ hour overhead for all of the retries.

I¹m trying to determine why there were FetchFailure exceptions, since
anything computed in the job that could not fit in the available memory
cache should be by default spilled to disk for further retrieval.  I also
see some "java.net.ConnectException: Connection refused² and
"java.io.IOException: sendMessageReliably failed without being ACK¹d" errors
in the logs after a CancelledKeyException followed by a
ClosedChannelException, but I have no idea why the nodes in the EMR cluster
would suddenly stop being able to communicate.

If anyone has ideas as to why the data needs to be rerun several times in
this job, please let me know as I am fairly bewildered about this behavior.

Re: Many retries for Python job

Posted by Brett Meyer <Br...@crowdstrike.com>.

According to the web UI I don¹t see any executors dying during Stage 2.  I
looked over the YARN logs and didn¹t see anything suspicious, but I may not
have been looking closely enough.  Stage 2 seems to complete just fine, it¹s
just when it enters Stage 3 that the results from the previous stage seem to
be missing in many cases and result in FetchFailure errors.  I should
probably also mention that I have the spark.storage.memoryFraction set to
0.2.

From:  Sandy Ryza <sa...@cloudera.com>
Date:  Friday, November 21, 2014 at 1:41 PM
To:  Brett Meyer <br...@crowdstrike.com>
Cc:  "user@spark.apache.org" <us...@spark.apache.org>
Subject:  Re: Many retries for Python job

Hi Brett, 

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?

-Sandy

On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer <Br...@crowdstrike.com>
wrote:
> I¹m running a Python script with spark-submit on top of YARN on an EMR cluster
> with 30 nodes.  The script reads in approximately 3.9 TB of data from S3, and
> then does some transformations and filtering, followed by some aggregate
> counts.  During Stage 2 of the job, everything looks to complete just fine
> with no executor failures or resubmissions, but when Stage 3 starts up, many
> Stage 2 tasks have to be rerun due to FetchFailure errors.  Actually, I
> usually see at least 3-4 retries on Stage 2 before Stage 3 can successfully
> start.  The whole application eventually completes, but there is an addition
> of about 1+ hour overhead for all of the retries.
> 
> I¹m trying to determine why there were FetchFailure exceptions, since anything
> computed in the job that could not fit in the available memory cache should be
> by default spilled to disk for further retrieval.  I also see some
> "java.net.ConnectException: Connection refused² and "java.io.IOException:
> sendMessageReliably failed without being ACK¹d" errors in the logs after a
> CancelledKeyException followed by a ClosedChannelException, but I have no idea
> why the nodes in the EMR cluster would suddenly stop being able to
> communicate.
> 
> If anyone has ideas as to why the data needs to be rerun several times in this
> job, please let me know as I am fairly bewildered about this behavior.

Re: Many retries for Python job

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Brett,

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?

-Sandy

On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer <Br...@crowdstrike.com>
wrote:

> I’m running a Python script with spark-submit on top of YARN on an EMR
> cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
> from S3, and then does some transformations and filtering, followed by some
> aggregate counts.  During Stage 2 of the job, everything looks to complete
> just fine with no executor failures or resubmissions, but when Stage 3
> starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
> Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
> successfully start.  The whole application eventually completes, but there
> is an addition of about 1+ hour overhead for all of the retries.
>
> I’m trying to determine why there were FetchFailure exceptions, since
> anything computed in the job that could not fit in the available memory
> cache should be by default spilled to disk for further retrieval.  I also
> see some "java.net.ConnectException: Connection refused” and
> "java.io.IOException: sendMessageReliably failed without being ACK’d"
> errors in the logs after a CancelledKeyException followed by
> a ClosedChannelException, but I have no idea why the nodes in the EMR
> cluster would suddenly stop being able to communicate.
>
> If anyone has ideas as to why the data needs to be rerun several times in
> this job, please let me know as I am fairly bewildered about this behavior.
>