You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kundan kumar <ii...@gmail.com> on 2015/10/28 07:30:43 UTC

org.apache.spark.shuffle.FetchFailedException: Failed to connect to ..... on worker failure

Hi,

I am running a Spark Streaming Job. I was testing the fault tolerance by
killing one of the workers using the kill -9 command.

What I understand is, when I kill a worker the process should not die and
resume the execution.

But, I am getting the following error and my process is halted.

org.apache.spark.shuffle.FetchFailedException: Failed to connect to .....



Now, when I restart the same worker or (2 workers were running on the
machine  and I killed just one of them) then the execution resumes and the
process is completed.

Please help me in understanding why on a worker failure my process is not
fault tolerant. Am I missing something ? Basically I need that my process
resumes even if a worker is lost.



Regards,
Kundan