You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kundan kumar <ii...@gmail.com> on 2015/10/28 07:30:43 UTC
org.apache.spark.shuffle.FetchFailedException: Failed to connect to
..... on worker failure
Hi,
I am running a Spark Streaming Job. I was testing the fault tolerance by
killing one of the workers using the kill -9 command.
What I understand is, when I kill a worker the process should not die and
resume the execution.
But, I am getting the following error and my process is halted.
org.apache.spark.shuffle.FetchFailedException: Failed to connect to .....
Now, when I restart the same worker or (2 workers were running on the
machine and I killed just one of them) then the execution resumes and the
process is completed.
Please help me in understanding why on a worker failure my process is not
fault tolerant. Am I missing something ? Basically I need that my process
resumes even if a worker is lost.
Regards,
Kundan