You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Griffiths, Michael (NYC-RPM)" <Mi...@reprisemedia.com> on 2014/11/10 22:47:34 UTC

Spark Master crashes job on task failure

Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node with the ec2 script, and I'm currently breaking the job into many small parts (~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The job consists of loading a file from S3, performing minor parsing, storing the results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe timeout errors - and for over half of the jobs that fail, when they are re-run they succeed. Still, a task failing shouldn't crash the entire job: it should just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a SparkException and brings down Spark Master. If it was a slave, it would be OK - it could either re-register and continue, or not, but the entire job would continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't see that option. Does that exist? Can I exclude Master from accepting tasks in Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what file in S3) is causing the error and fixing it. But right now I'd just like to get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael

RE: Spark Master crashes job on task failure

Posted by "Griffiths, Michael (NYC-RPM)" <Mi...@reprisemedia.com>.
Nevermind - I don't know what I was thinking with the below. It's just maxTaskFailures causing the job to failure.

From: Griffiths, Michael (NYC-RPM) [mailto:Michael.Griffiths@reprisemedia.com]
Sent: Monday, November 10, 2014 4:48 PM
To: user@spark.apache.org
Subject: Spark Master crashes job on task failure

Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node with the ec2 script, and I'm currently breaking the job into many small parts (~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The job consists of loading a file from S3, performing minor parsing, storing the results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe timeout errors - and for over half of the jobs that fail, when they are re-run they succeed. Still, a task failing shouldn't crash the entire job: it should just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a SparkException and brings down Spark Master. If it was a slave, it would be OK - it could either re-register and continue, or not, but the entire job would continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't see that option. Does that exist? Can I exclude Master from accepting tasks in Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what file in S3) is causing the error and fixing it. But right now I'd just like to get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael