You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tomas Barton (JIRA)" <ji...@apache.org> on 2014/09/11 00:01:34 UTC

[jira] [Commented] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode

    [ https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129206#comment-14129206 ] 

Tomas Barton commented on SPARK-2445:
-------------------------------------

still the same issue, this is error was produced on Spark examples, _SparkLR_
{code}
14/09/10 23:52:08 INFO BlockManagerInfo: Registering block manager 172.27.11.13:51098 with 294.6 MB RAM
14/09/10 23:52:08 INFO BlockManagerInfo: Registering block manager 172.27.11.11:59588 with 294.6 MB RAM
14/09/10 23:52:09 INFO BlockManagerInfo: Added rdd_0_0 in memory on 172.27.11.11:59588 (size: 919.1 KB, free: 293.7 MB)
14/09/10 23:52:09 INFO BlockManagerInfo: Added rdd_0_1 in memory on 172.27.11.11:59588 (size: 919.1 KB, free: 292.8 MB)
14/09/10 23:52:09 INFO BlockManagerInfo: Added rdd_0_0 in memory on 172.27.11.13:51098 (size: 919.1 KB, free: 293.7 MB)
14/09/10 23:52:10 INFO TaskSetManager: Finished TID 9 in 5233 ms on 172.27.11.11 (progress: 1/2)
14/09/10 23:52:10 INFO DAGScheduler: Completed ResultTask(2, 1)
14/09/10 23:52:10 INFO DAGScheduler: Completed ResultTask(2, 0)
14/09/10 23:52:10 INFO TaskSetManager: Finished TID 10 in 1958 ms on 172.27.11.11 (progress: 2/2)
14/09/10 23:52:10 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
14/09/10 23:52:10 INFO DAGScheduler: Stage 2 (reduce at SparkLR.scala:64) finished in 6.055 s
14/09/10 23:52:10 INFO SparkContext: Job finished: reduce at SparkLR.scala:64, took 6.079637186 s
On iteration 4
14/09/10 23:52:10 INFO SparkContext: Starting job: reduce at SparkLR.scala:64
14/09/10 23:52:10 INFO DAGScheduler: Got job 3 (reduce at SparkLR.scala:64) with 2 output partitions (allowLocal=false)
14/09/10 23:52:10 INFO DAGScheduler: Final stage: Stage 3(reduce at SparkLR.scala:64)
14/09/10 23:52:10 INFO DAGScheduler: Parents of final stage: List()
14/09/10 23:52:10 INFO DAGScheduler: Missing parents: List()
14/09/10 23:52:10 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[4] at map at SparkLR.scala:62), which has no missing parents
14/09/10 23:52:10 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (MappedRDD[4] at map at SparkLR.scala:62)
14/09/10 23:52:10 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
14/09/10 23:52:10 INFO TaskSetManager: Starting task 3.0:0 as TID 11 on executor 20140910-231511-185277356-5050-425-101: 172.27.11.11 (PROCESS_LOCAL)
14/09/10 23:52:10 INFO TaskSetManager: Serialized task 3.0:0 as 667088 bytes in 26 ms
14/09/10 23:52:10 INFO TaskSetManager: Starting task 3.0:1 as TID 12 on executor 20140910-231511-185277356-5050-425-101: 172.27.11.11 (PROCESS_LOCAL)
14/09/10 23:52:10 INFO TaskSetManager: Serialized task 3.0:1 as 667088 bytes in 24 ms
14/09/10 23:52:10 INFO TaskSetManager: Re-queueing tasks for 20140910-231511-185277356-5050-425-101 from TaskSet 3.0
14/09/10 23:52:10 WARN TaskSetManager: Lost TID 11 (task 3.0:0)
14/09/10 23:52:10 WARN TaskSetManager: Lost TID 12 (task 3.0:1)
14/09/10 23:52:10 INFO DAGScheduler: Executor lost: 20140910-231511-185277356-5050-425-101 (epoch 4)
14/09/10 23:52:10 INFO BlockManagerMasterActor: Trying to remove executor 20140910-231511-185277356-5050-425-101 from BlockManagerMaster.
14/09/10 23:52:10 INFO BlockManagerMaster: Removed 20140910-231511-185277356-5050-425-101 successfully in removeExecutor
14/09/10 23:52:10 INFO DAGScheduler: Host added was in lost list earlier: 172.27.11.11
14/09/10 23:52:10 INFO TaskSetManager: Starting task 3.0:1 as TID 13 on executor 20140910-231511-185277356-5050-425-102: 172.27.11.13 (PROCESS_LOCAL)
14/09/10 23:52:10 INFO TaskSetManager: Serialized task 3.0:1 as 667088 bytes in 9 ms
14/09/10 23:52:14 INFO TaskSetManager: Starting task 3.0:0 as TID 14 on executor 20140910-231511-185277356-5050-425-101: 172.27.11.11 (NODE_LOCAL)
14/09/10 23:52:14 INFO TaskSetManager: Serialized task 3.0:0 as 667088 bytes in 14 ms
14/09/10 23:52:14 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140910-231511-185277356-5050-425-102
{code}

a workaround is to switch to coarse grained mode:

{code}
export SPARK_DAEMON_JAVA_OPTS="-Dspark.mesos.coarse=true"
{code}


> MesosExecutorBackend crashes in fine grained mode
> -------------------------------------------------
>
>                 Key: SPARK-2445
>                 URL: https://issues.apache.org/jira/browse/SPARK-2445
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos
>    Affects Versions: 1.0.0
>            Reporter: Dario Rexin
>
> When multiple instances of the MesosExecutorBackend are running on the same slave, they will have the same executorId assigned (equal to the mesos slaveId), but will have a different port (which is randomly assigned). Because of this, it can not register a new BlockManager, because one is already registered with the same executorId, but a different BlockManagerId. More description and a fix can be found in this PR on GitHub:
> https://github.com/apache/spark/pull/1358



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org