You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shannon Quinn <sq...@gatech.edu> on 2014/07/18 20:30:39 UTC

Job aborted due to stage failure: TID x failed for unknown reasons

Hi all,

I'm dealing with some strange error messages that I *think* comes down 
to a memory issue, but I'm having a hard time pinning it down and could 
use some guidance from the experts.

I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; 
one has 16GB memory, the other 32GB (which is the master). My 
application involves computing pairwise pixel affinities in images, 
though the images I've tested so far only get as big as 1920x1200, and 
as small as 16x16.

I did have to change a few memory and parallelism settings, otherwise I 
was getting explicit OutOfMemoryExceptions. In spark-default.conf:

     spark.executor.memory    14g
     spark.default.parallelism    32
     spark.akka.frameSize        1000

In spark-env.sh:

     SPARK_DRIVER_MEMORY=10G

With those settings, however, I get a bunch of WARN statements about 
"Lost TIDs" (no task is successfully completed) in addition to lost 
Executors, which are repeated 4 times until I finally get the following 
error message and crash:

---

14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0
14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at 
/home/user/Programming/PySpark-Affinities/affinity.py:243
Traceback (most recent call last):
   File "/home/user/Programming/PySpark-Affinities/affinity.py", line 
243, in <module>
     lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]])
   File 
"/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py", 
line 583, in collect
     bytesInJava = self._jrdd.collect().iterator()
   File 
"/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", 
line 537, in __call__
   File 
"/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0.0:13 failed 4 times, most recent failure: *TID 32 on host 
master.host.univ.edu failed for unknown reason*
Driver stacktrace:
     at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
     at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
     at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
     at scala.Option.foreach(Option.scala:236)
     at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
     at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
     at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
     at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
     at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
     at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4)
14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to remove 
executor 4 from BlockManagerMaster.
14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully in 
removeExecutor
user@master:~/Programming/PySpark-Affinities$

---

If I run the really small image instead (16x16), it *appears* to run to 
completion (gives me the output I expect without any exceptions being 
thrown). However, in the stderr logs for the app that was run, it lists 
the state as "KILLED" with the final message a "ERROR 
CoarseGrainedExecutorBackend: Driver Disassociated". If I run any larger 
images, I get the exception I pasted above.

Furthermore, if I just do a spark-submit with master=local[*], aside 
from still needing to set the aforementioned memory options, it will 
work for an image of *any* size (I've tested both machines 
independently; they both do this when running as local[*]), whereas 
working on a cluster will result in the aforementioned crash at stage 0 
with anything but the smallest images.

Any ideas what is going on?

Thank you very much in advance!

Regards,
Shannon

Re: Job aborted due to stage failure: TID x failed for unknown reasons

Posted by jerryye <je...@gmail.com>.
bump. same problem here.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Job aborted due to stage failure: TID x failed for unknown reasons

Posted by Alessandro Lulli <al...@gmail.com>.
Hi All,

Can someone help on this?

I'm encountering exactly the same issue in a very similar scenario with the
same spark version.

Thanks
Alessandro


On Fri, Jul 18, 2014 at 8:30 PM, Shannon Quinn <sq...@gatech.edu> wrote:

>  Hi all,
>
> I'm dealing with some strange error messages that I *think* comes down to
> a memory issue, but I'm having a hard time pinning it down and could use
> some guidance from the experts.
>
> I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; one
> has 16GB memory, the other 32GB (which is the master). My application
> involves computing pairwise pixel affinities in images, though the images
> I've tested so far only get as big as 1920x1200, and as small as 16x16.
>
> I did have to change a few memory and parallelism settings, otherwise I
> was getting explicit OutOfMemoryExceptions. In spark-default.conf:
>
>     spark.executor.memory    14g
>     spark.default.parallelism    32
>     spark.akka.frameSize        1000
>
> In spark-env.sh:
>
>     SPARK_DRIVER_MEMORY=10G
>
> With those settings, however, I get a bunch of WARN statements about "Lost
> TIDs" (no task is successfully completed) in addition to lost Executors,
> which are repeated 4 times until I finally get the following error message
> and crash:
>
> ---
>
> 14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0
> 14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at
> /home/user/Programming/PySpark-Affinities/affinity.py:243
> Traceback (most recent call last):
>   File "/home/user/Programming/PySpark-Affinities/affinity.py", line 243,
> in <module>
>     lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]])
>   File
> "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py",
> line 583, in collect
>     bytesInJava = self._jrdd.collect().iterator()
>   File
> "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> line 537, in __call__
>   File
> "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0.0:13 failed 4 times, most recent failure: *TID 32 on host
> master.host.univ.edu <http://master.host.univ.edu> failed for unknown
> reason*
> Driver stacktrace:
>     at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>     at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>     at scala.Option.foreach(Option.scala:236)
>     at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>     at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>     at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> 14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4)
> 14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to remove executor
> 4 from BlockManagerMaster.
> 14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully in
> removeExecutor
> user@master:~/Programming/PySpark-Affinities$
>
> ---
>
> If I run the really small image instead (16x16), it *appears* to run to
> completion (gives me the output I expect without any exceptions being
> thrown). However, in the stderr logs for the app that was run, it lists the
> state as "KILLED" with the final message a "ERROR
> CoarseGrainedExecutorBackend: Driver Disassociated". If I run any larger
> images, I get the exception I pasted above.
>
> Furthermore, if I just do a spark-submit with master=local[*], aside from
> still needing to set the aforementioned memory options, it will work for an
> image of *any* size (I've tested both machines independently; they both
> do this when running as local[*]), whereas working on a cluster will result
> in the aforementioned crash at stage 0 with anything but the smallest
> images.
>
> Any ideas what is going on?
>
> Thank you very much in advance!
>
> Regards,
> Shannon
>