You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Stephen Haberman <st...@gmail.com> on 2013/11/11 22:04:30 UTC

task not getting finished

Hi,

We had a cluster hang using Spark master (standalone, as an EMR job) a
few days ago. The last output in the master log is that Stage 5 was
still running:

14:26 INFO scheduler.DAGScheduler: running: Set(Stage 5)

So, to see what task wasn't complete, I went looking for "Completed
ShuffleMapTask(5, ...)" entries. To find which one was not there, I ran:

$ cat master.stderr | grep 'MapTask(5, ' | cut -d ' ' -f 7 | sort

Stage 5 had 58 tasks, and 0-57 were all completed, except task 14. So I
assume that is what was hanging the job. Looking for task 14, it got
started:

14:15 cluster.ClusterTaskSetManager: Starting task 5.0:14 as TID 523 on
executor 8: ip-10-40-7-103.ec2.internal (PROCESS_LOCAL)

And it showed up on that slave:

14:15 executor.Executor: Running task ID 523

But then later the master ignored the slave's task update:

14:16 cluster.ClusterScheduler: Ignoring update from TID 523 because
its task set is gone

But then that's it. I don't know anything about the "task set gone"
scenario; what should have happened? Should it have been retried?

I've uploaded the master log, slave log, and a slave jstack here:

https://gist.github.com/anonymous/ed6a57b0e3aff13d9f49

There are also a few suspicious looking connection failures/Could not
get block errors on the slave a minute or so after it started running
TID 523, so I would not be surprised if this task actually failed on
the slave, but whatever the result, the master seems to have ignored
it, and so didn't retry.

Are these logs enough for someone to piece together what happened? Any
hints where I could look? Is there something else I could grab off the
cluster next time it happens that would help?

Thanks!

- Stephen