You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rama0120 <la...@gmail.com> on 2014/04/20 17:05:43 UTC

Spark recovery from bad nodes

Hi,I am unable to see how Shark (eventually Spark) can recover from a bad
node in the cluster. One of my EC2 clusters with 50 nodes ended up with a
single node with datanode corruption, I can see the following error when I'm
trying to load up a simple file into memory using CTAS:
org.apache.hadoop.hive.ql.metadata.HiveException
(org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: All
datanodes 10.196.4.2:9200 are bad.
Aborting...)org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)scala.collection.Iterator$class.foreach(Iterator.scala:772).....I'd
expect Spark to recover from this by blacklisting that node and using the
rest of the cluster to finish this task. However, it doesn't recover from
it. I saw this email chain from around a year ago:
https://groups.google.com/forum/#!topic/spark-users/maFIDL-0OoI - it was
acknowledged that Spark wouldn't recover from a partially failed node.. Is
this still the situation? Has any progress been made here? Even after I kill
the application and start a brand new one and retry the job, it ends up in
the same situation.. I'd like to know how to handle this.Thanks!Lakshmi.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-from-bad-nodes-tp4505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark recovery from bad nodes

Posted by Praveen R <pr...@sigmoidanalytics.com>.
Please check my comment on the shark-users
thread<https://groups.google.com/forum/#!searchin/shark-users/Failure$20recovery$20in$20Shark$20when$20cluster$20/shark-users/vUUGLZANxr8/MMCtKhqjhLMJ>
.


On Tue, Apr 22, 2014 at 8:06 AM, rama0120 <la...@gmail.com>wrote:

> Hi,
>
> I couldn't find any details regarding this recovery mechanism - could
> someone please shed some light on this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-from-bad-nodes-tp4505p4576.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Spark recovery from bad nodes

Posted by rama0120 <la...@gmail.com>.
Hi,

I couldn't find any details regarding this recovery mechanism - could
someone please shed some light on this? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-from-bad-nodes-tp4505p4576.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.