You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Matthias Kricke <Ma...@mgm-tp.com> on 2014/07/16 09:21:05 UTC

How does Apache Spark handles system failure when deployed in YARN?

Hello @ the mailing list,

We think of using spark in one of our projects in a Hadoop cluster. During evaluation several questions remain which are stated below.

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
1.     One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
*        What will happen to tasks that where running at that node?
2.     One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
*        How will it handle this situation?
3.     During execution the primary namenode fails over.
*        Did spark automatically use the fail over namenode?
*        What happens when the secondary namenode fails as well?
4.     For some reasons during a work flow the cluster is totally shut down.
*        Will spark restart with the cluster automatically?
*        Will it resume to the last "save" point during the work flow?

Thanks in advance. :)
Best regards
Matthias Kricke


AW: How does Apache Spark handles system failure when deployed in YARN?

Posted by Matthias Kricke <Ma...@mgm-tp.com>.
Thanks, your answers totally cover all my questions ☺

Von: Sandy Ryza [mailto:sandy.ryza@cloudera.com]
Gesendet: Mittwoch, 16. Juli 2014 09:41
An: user@spark.apache.org
Betreff: Re: How does Apache Spark handles system failure when deployed in YARN?

Hi Matthias,

Answers inline.

-Sandy

On Wed, Jul 16, 2014 at 12:21 AM, Matthias Kricke <Ma...@mgm-tp.com>> wrote:
Hello @ the mailing list,

We think of using spark in one of our projects in a Hadoop cluster. During evaluation several questions remain which are stated below.

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
1.     One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
•        What will happen to tasks that where running at that node?

Spark will rerun those tasks on a different node.

2.     One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
•        How will it handle this situation?

After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job.

3.     During execution the primary namenode fails over.
•        Did spark automatically use the fail over namenode?
•        What happens when the secondary namenode fails as well?

Spark accesses HDFS through the normal HDFS client APIs.  Under an HA configuration, these will automatically fail over to the new namenode.  If no namenodes are left, the Spark job will fail.

4.     For some reasons during a work flow the cluster is totally shut down.
•        Will spark restart with the cluster automatically?
•        Will it resume to the last "save" point during the work flow?

Can you elaborate a little more on what you mean by "the cluster is totally shut down"?  Do you mean HDFS becomes inaccessible or all the nodes in the cluster simultaneously lose power?  Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available.

Thanks in advance. :)
Best regards
Matthias Kricke



Re: How does Apache Spark handles system failure when deployed in YARN?

Posted by Sandy Ryza <sa...@cloudera.com>.
Hi Matthias,

Answers inline.

-Sandy


On Wed, Jul 16, 2014 at 12:21 AM, Matthias Kricke <
Matthias.Kricke@mgm-tp.com> wrote:

>   Hello @ the mailing list,
>
>
>
> We think of using spark in one of our projects in a Hadoop cluster. During
> evaluation several questions remain which are stated below.
>
>
>
> *Preconditions*
>
> Let's assume Apache Spark is deployed on a hadoop cluster using YARN.
> Furthermore a spark execution is running. How does spark handle the
> situations listed below?
>
> *Cases & Questions*
>
> 1.     One node of the hadoop clusters fails due to a disc error. However
> replication is high enough and no data was lost.
>
> ·        *What will happen to tasks that where running at that node?*
>

Spark will rerun those tasks on a different node.


>  2.     One node of the hadoop clusters fails due to a disc error.
> Replication was *not* high enough and data was lost. Simply spark
> couldn't find a file anymore which was pre-configured as resource for the
> work flow.
>
> ·        *How will it handle this situation?*
>

After a number of failed task attempts trying to read the block, Spark
would pass up whatever error HDFS is returning and fail the job.


>  3.     During execution the primary namenode fails over.
>
> ·        *Did spark automatically use the fail over namenode?*
>
> ·        *What happens when the secondary namenode fails as well?*
>

Spark accesses HDFS through the normal HDFS client APIs.  Under an HA
configuration, these will automatically fail over to the new namenode.  If
no namenodes are left, the Spark job will fail.


>  4.     For some reasons during a work flow the cluster is totally shut
> down.
>
> ·        *Will spark restart with the cluster automatically?*
>
> ·        *Will it resume to the last "save" point during the work flow?*
>
>
>
Can you elaborate a little more on what you mean by "the cluster is totally
shut down"?  Do you mean HDFS becomes inaccessible or all the nodes in the
cluster simultaneously lose power?  Spark has support for checkpointing to
HDFS, so you would be able to go back to the last time checkpoint was
called that HDFS was available.

 Thanks in advance. :)
>
> Best regards
>
> Matthias Kricke
>
>
>