You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/05/15 15:45:59 UTC
[jira] [Resolved] (SPARK-1848) Executors are mysteriously dying when using Spark on Mesos

     [ https://issues.apache.org/jira/browse/SPARK-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-1848.
------------------------------
    Resolution: Cannot Reproduce

I think this is at least stale at this point.

> Executors are mysteriously dying when using Spark on Mesos
> ----------------------------------------------------------
>
>                 Key: SPARK-1848
>                 URL: https://issues.apache.org/jira/browse/SPARK-1848
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Spark Core
>    Affects Versions: 1.0.0
>         Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mesos 0.18.0
> Spark Master
>            Reporter: Bouke van der Bijl
>
> Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a
> We have 47 machines running Mesos that we're trying to run Spark jobs on, but they fail at some point because tasks have to get rescheduled too often, which is caused by Spark killing the tasks because of executors dying. When I look at the stderr or stdout of the Mesos slaves, there seem to be no indication of an error happening and sometimes I can see a "14/05/15 17:38:54 INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from <id>" which would indicate that the executor just keeps going and hasn't actually died. If I add a Thread.dumpStack() at the location where the job is killed, this is the trace it returns: 
>         at java.lang.Thread.dumpStack(Thread.java:1364)
>         at org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588)
>         at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665)
>         at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664)
>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>         at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>         at org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664)
>         at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
>         at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
>         at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271)
>         at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266)
>         at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287)
> What could cause this? Is this a set up problem with our cluster or a bug in spark?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org