You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/01/23 01:04:55 UTC

Spark does not retry failed tasks initiated by hadoop

Hi,

I've written about this issue before, but there was no reply.

It seems when a task fails due to hadoop io errors, spark does not retry
that task, and only reports it as a failed task, carrying on the other
tasks. As an example:

WARN ClusterTaskSetManager: Loss was due to java.io.IOException
java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)


I think almost all spark applications need to have 0 failed task in order
to produce a meaningful result.

These io errors are not usually repeatable, and they might not occur after
a retry. Is there a setting in spark enforce a retry upon such failed tasks?

Re: Spark does not retry failed tasks initiated by hadoop

Posted by Aureliano Buendia <bu...@gmail.com>.

On Thu, Jan 23, 2014 at 12:44 AM, Patrick Wendell <pw...@gmail.com>wrote:

> What makes you think it isn't retrying the task?


Because the output misses some rows.


> By default it tries
> three times... it only prints the error once though. in this case if
> your cluster doesn't have any datanodes it's likely that it failed
> several times.
>

You are probably right. So in th emissing rows case, spark has tries 3
times and gave up. Still, in web ui, there is no way to detect if a failed
task was correctly executed after a retry, or it was given up. I'm not sure
if this can be detected in the log.


>
> On Wed, Jan 22, 2014 at 4:04 PM, Aureliano Buendia <bu...@gmail.com>
> wrote:
> > Hi,
> >
> > I've written about this issue before, but there was no reply.
> >
> > It seems when a task fails due to hadoop io errors, spark does not retry
> > that task, and only reports it as a failed task, carrying on the other
> > tasks. As an example:
> >
> > WARN ClusterTaskSetManager: Loss was due to java.io.IOException
> > java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)
> >
> >
> > I think almost all spark applications need to have 0 failed task in
> order to
> > produce a meaningful result.
> >
> > These io errors are not usually repeatable, and they might not occur
> after a
> > retry. Is there a setting in spark enforce a retry upon such failed
> tasks?
>

Re: Spark does not retry failed tasks initiated by hadoop

Posted by Patrick Wendell <pw...@gmail.com>.

What makes you think it isn't retrying the task? By default it tries
three times... it only prints the error once though. in this case if
your cluster doesn't have any datanodes it's likely that it failed
several times.

On Wed, Jan 22, 2014 at 4:04 PM, Aureliano Buendia <bu...@gmail.com> wrote:
> Hi,
>
> I've written about this issue before, but there was no reply.
>
> It seems when a task fails due to hadoop io errors, spark does not retry
> that task, and only reports it as a failed task, carrying on the other
> tasks. As an example:
>
> WARN ClusterTaskSetManager: Loss was due to java.io.IOException
> java.io.IOException: All datanodes x.x.x.x:50010 are bad. Aborting...
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2589)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2793)
>
>
> I think almost all spark applications need to have 0 failed task in order to
> produce a meaningful result.
>
> These io errors are not usually repeatable, and they might not occur after a
> retry. Is there a setting in spark enforce a retry upon such failed tasks?