You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marek Wiewiorka <ma...@gmail.com> on 2014/01/28 16:07:56 UTC

Problem with running Spark over Mesos in fine-grained mode

Hi All,
I trying to run my app over Mesos in in fine-grained mode and I'm getting
the attached errors(both Spark driver and Mesos slave).
The same app in coarse-grained mode runs fine. The only thing I'm changing
is setting the spark.mesos.coarse param to true/false.
System.setProperty("spark.mesos.coarse", "false")
If set to true everything is fine, if set to false(default)  errors below
occur.

I'm using:
Mesos 0.16-rc3
Spark 0.9.0-rc5

I would appreciate any hints.

Regards,
Marek



Spark driver
=============
14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/01/28 14:51:08 INFO Remoting: Starting remoting
14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140128145108-e1d9
14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with
capacity 589.2 MB.
14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port
34442 with id = ConnectionManagerId(XXXXXX,34442)
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Registering block manager XXXXXX:34442 with 589.2 MB RAM
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:41485
14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started at
http://10.0.0.8:41485
14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:47676
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/stage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/pool,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/environment,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/executors,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/,null}
14/01/28 14:51:09 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at
http://XXXXXX:4040
14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called
with curMem=0, maxMem=617820979
14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 90.8 KB, free 589.1 MB)
14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process
: 1
14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at
SparkSeqSimpleJob.scala:48
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4
(reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at
SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count
at SparkSeqSimpleJob.scala:48)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage:
List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage
1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1
(MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which
has no missing parents
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks
from Stage 1 (MapPartitionsRDD[4] at reduceByKey at
SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 24 tasks
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID
0 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as
2500 bytes in 13 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID
1 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as
2500 bytes in 0 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as TID
2 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as
2500 bytes in 1 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as TID
3 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as
2500 bytes in 1 ms
14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for
201401281448-134217738-5050-17953-1 from TaskSet 1.0
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost:
201401281448-134217738-5050-17953-1 (epoch 0)
14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove
executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed
201401281448-134217738-5050-17953-1 successfully in removeExecutor
1

Mesos SLAVE LOG
============
I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching
201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in
/tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9
with resources mem(*):10240' for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at
18386
I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor
201401281448-134217738-5050-17953-1 of framework
201401281448-134217738-5050-17953-0000 forked at pid 18386
I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of
terminated executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.428513 17992 slave.cpp:2145] Executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000 has exited with status 0
I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
task 0 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
task 1 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the
process tree rooted at pid 18386: Failed to find process 18386
I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
task 2 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
task 3 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework
201401281448-134217738-5050-17953-0000

Re: Problem with running Spark over Mesos in fine-grained mode

Posted by Alberto Miorin <sp...@ululi.it>.
Hi all!

I've the same problem with the same Exception. I've opened a ticket :
https://spark-project.atlassian.net/browse/SPARK-1052

I didn't try with the coarse-grained.

Alberto


On Wed, Jan 29, 2014 at 12:56 PM, Marek Wiewiorka <marek.wiewiorka@gmail.com
> wrote:

> Hi Benjamin,
> thanks for the response!
> These are the logs you requsted:
> stdout:
> ========
> Fetching resources into
> '/tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-27878-1/runs/9dc28fa1-7
> 8ac-44de-b73d-48bf624aadc2'
> Fetching resource
> 'hdfs://XXXXX:9000/frameworks/spark/0.9.0/spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
> Extracting resource: tar xzf
> './spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
> Running spark-executor with framework dir =
> /tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-2787
> 8-1/runs/9dc28fa1-78ac-44de-b73d-48bf624aadc2/spark-0.9.0-incubating
> 2014-01-29 11:41:46,626 [Thread-0] INFO
>  org.apache.spark.executor.MesosExecutorBackend - Registered with Mesos as
> executor ID 201401291036-134217738-5050-27878-1
> 2014-01-29 11:41:47,898 [spark-akka.actor.default-dispatcher-4] INFO
>  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
> 2014-01-29 11:41:47,980 [spark-akka.actor.default-dispatcher-4] INFO
>  Remoting - Starting remoting
> 2014-01-29 11:41:48,303 [spark-akka.actor.default-dispatcher-4] INFO
>  Remoting - Remoting started; listening on addresses
> :[akka.tcp://spark@XXXXX:51350]
> 2014-01-29 11:41:48,306 [spark-akka.actor.default-dispatcher-2] INFO
>  Remoting - Remoting now listens on addresses: [akka.tcp://spark@XXXXX
> :51350]
> 2014-01-29 11:41:48,335 [Thread-0] INFO  org.apache.spark.SparkEnv -
> Connecting to BlockManagerMaster: akka.tcp://spark@XXXXX
> :7077/user/BlockManagerMaster
>
> stderr
> =====
> akka.actor.ActorNotFound: Actor not found for:
> ActorSelection[Actor[akka.tcp://spark@XXXXX
> :7077/]/user/BlockManagerMaster]
>         at
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
>         at
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
>         at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
>         at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>         at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>         at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>         at
> akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
>         at
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
>         at
> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
>         at
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
>         at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>         at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
>         at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
>         at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
>         at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
>         at
> akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
>         at akka.actor.ActorRef.tell(ActorRef.scala:125)
>         at
> akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
>         at
> akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
>         at
> akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
>         at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
>         at
> akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
>         at
> akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
>         at
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
>         at
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
>         at akka.actor.ActorCell.terminate(ActorCell.scala:338)
>         at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
>         at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
>         at
> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:218)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Exception in thread "Thread-0"
>
> Marek
>
>
>
> 2014-01-29 Benjamin Hindman <be...@gmail.com>
>
> Hey Marek,
>>
>> Can you include the stdout and stderr of the executors that are running?
>> You should be able to get them from the Mesos master web UI or on the slave
>> directly. If they are empty then you'll need to set up the log4j properties.
>>
>> Ben.
>>
>>
>> On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka <
>> marek.wiewiorka@gmail.com> wrote:
>>
>>> Hi All,
>>> I trying to run my app over Mesos in in fine-grained mode and I'm
>>> getting the attached errors(both Spark driver and Mesos slave).
>>> The same app in coarse-grained mode runs fine. The only thing I'm
>>> changing is setting the spark.mesos.coarse param to true/false.
>>> System.setProperty("spark.mesos.coarse", "false")
>>> If set to true everything is fine, if set to false(default)  errors
>>> below occur.
>>>
>>> I'm using:
>>> Mesos 0.16-rc3
>>> Spark 0.9.0-rc5
>>>
>>> I would appreciate any hints.
>>>
>>> Regards,
>>> Marek
>>>
>>>
>>>
>>> Spark driver
>>> =============
>>> 14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>> 14/01/28 14:51:08 INFO Remoting: Starting remoting
>>> 14/01/28 14:51:08 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://spark@XXXXXX:49899]
>>> 14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses:
>>> [akka.tcp://spark@XXXXXX:49899]
>>> 14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
>>> 14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory
>>> at /tmp/spark-local-20140128145108-e1d9
>>> 14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with
>>> capacity 589.2 MB.
>>> 14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port
>>> 34442 with id = ConnectionManagerId(XXXXXX,34442)
>>> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register
>>> BlockManager
>>> 14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
>>> Registering block manager XXXXXX:34442 with 589.2 MB RAM
>>> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered
>>> BlockManager
>>> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
>>> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
>>> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:41485
>>> 14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started
>>> at http://10.0.0.8:41485
>>> 14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
>>> 14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory
>>> is /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
>>> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
>>> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
>>> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:47676
>>>  14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/storage/rdd,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/storage,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/stages/stage,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/stages/pool,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/stages,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/environment,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/executors,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/metrics/json,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/static,null}
>>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>>> o.e.j.s.h.ContextHandler{/,null}
>>> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
>>> SelectChannelConnector@0.0.0.0:4040
>>> 14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at
>>> http://XXXXXX:4040
>>> 14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955)
>>> called with curMem=0, maxMem=617820979
>>> 14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as
>>> values to memory (estimated size 90.8 KB, free 589.1 MB)
>>> 14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to
>>> process : 1
>>> 14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at
>>> SparkSeqSimpleJob.scala:48
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4
>>> (reduceByKey at SparkSeqAnalysis.scala:355)
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at
>>> SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0
>>> (count at SparkSeqSimpleJob.scala:48)
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage:
>>> List(Stage 1)
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents:
>>> List(Stage 1)
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1
>>> (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which
>>> has no missing parents
>>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing
>>> tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at
>>> SparkSeqAnalysis.scala:355)
>>> 14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
>>> with 24 tasks
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as
>>> TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>>> (PROCESS_LOCAL)
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0
>>> as 2500 bytes in 13 ms
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as
>>> TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>>> (PROCESS_LOCAL)
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1
>>> as 2500 bytes in 0 ms
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as
>>> TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>>> (PROCESS_LOCAL)
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2
>>> as 2500 bytes in 1 ms
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as
>>> TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>>> (PROCESS_LOCAL)
>>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3
>>> as 2500 bytes in 1 ms
>>> 14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for
>>> 201401281448-134217738-5050-17953-1 from TaskSet 1.0
>>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
>>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
>>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
>>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
>>> 14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost:
>>> 201401281448-134217738-5050-17953-1 (epoch 0)
>>> 14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove
>>> executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
>>> 14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed
>>> 201401281448-134217738-5050-17953-1 successfully in removeExecutor
>>> 1
>>>
>>> Mesos SLAVE LOG
>>> ============
>>> I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for
>>> executor 201401281448-134217738-5050-17953-1 of framework
>>> '201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching
>>> 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in
>>> /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9
>>> with resources mem(*):10240' for framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for
>>> executor 201401281448-134217738-5050-17953-1 of framework
>>> '201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for
>>> executor 201401281448-134217738-5050-17953-1 of framework
>>> '201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for
>>> framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor
>>> 201401281448-134217738-5050-17953-1 of framework
>>> '201401281448-134217738-5050-17953-0000
>>> I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at
>>> 18386
>>> I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor
>>> 201401281448-134217738-5050-17953-1 of framework
>>> 201401281448-134217738-5050-17953-0000 forked at pid 18386
>>> I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for
>>> executor '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for
>>> executor '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for
>>> executor '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for
>>> executor '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for
>>> executor '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of
>>> terminated executor '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.428513 17992 slave.cpp:2145] Executor
>>> '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000 has exited with status 0
>>> I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update
>>> TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
>>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>>> I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received
>>> status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
>>> task 0 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update
>>> TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
>>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>>> I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding
>>> status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
>>> task 0 of framework 201401281448-134217738-5050-17953-0000 to
>>> master@10.0.0.8:5050
>>> I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received
>>> status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
>>> task 1 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding
>>> status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
>>> task 1 of framework 201401281448-134217738-5050-17953-0000 to
>>> master@10.0.0.8:5050
>>> W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the
>>> process tree rooted at pid 18386: Failed to find process 18386
>>> I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update
>>> resources for an unknown/killed executor
>>> '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update
>>> resources for an unknown/killed executor
>>> '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update
>>> TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
>>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>>> I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update
>>> resources for an unknown/killed executor
>>> '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received
>>> status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
>>> task 2 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding
>>> status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
>>> task 2 of framework 201401281448-134217738-5050-17953-0000 to
>>> master@10.0.0.8:5050
>>> I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update
>>> TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
>>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>>> I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update
>>> resources for an unknown/killed executor
>>> '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received
>>> status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
>>> task 3 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding
>>> status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
>>> task 3 of framework 201401281448-134217738-5050-17953-0000 to
>>> master@10.0.0.8:5050
>>> I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received
>>> status update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475)
>>> for task 0 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received
>>> status update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b)
>>> for task 1 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received
>>> status update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab)
>>> for task 2 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received
>>> status update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b)
>>> for task 3 of framework 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor
>>> '201401281448-134217738-5050-17953-1' of framework
>>> 201401281448-134217738-5050-17953-0000
>>> I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework
>>> 201401281448-134217738-5050-17953-0000
>>>
>>>
>>
>

Re: Problem with running Spark over Mesos in fine-grained mode

Posted by Marek Wiewiorka <ma...@gmail.com>.
Hi Benjamin,
thanks for the response!
These are the logs you requsted:
stdout:
========
Fetching resources into
'/tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-27878-1/runs/9dc28fa1-7
8ac-44de-b73d-48bf624aadc2'
Fetching resource
'hdfs://XXXXX:9000/frameworks/spark/0.9.0/spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
Extracting resource: tar xzf
'./spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
Running spark-executor with framework dir =
/tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-2787
8-1/runs/9dc28fa1-78ac-44de-b73d-48bf624aadc2/spark-0.9.0-incubating
2014-01-29 11:41:46,626 [Thread-0] INFO
 org.apache.spark.executor.MesosExecutorBackend - Registered with Mesos as
executor ID 201401291036-134217738-5050-27878-1
2014-01-29 11:41:47,898 [spark-akka.actor.default-dispatcher-4] INFO
 akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2014-01-29 11:41:47,980 [spark-akka.actor.default-dispatcher-4] INFO
 Remoting - Starting remoting
2014-01-29 11:41:48,303 [spark-akka.actor.default-dispatcher-4] INFO
 Remoting - Remoting started; listening on addresses
:[akka.tcp://spark@XXXXX:51350]
2014-01-29 11:41:48,306 [spark-akka.actor.default-dispatcher-2] INFO
 Remoting - Remoting now listens on addresses: [akka.tcp://spark@XXXXX
:51350]
2014-01-29 11:41:48,335 [Thread-0] INFO  org.apache.spark.SparkEnv -
Connecting to BlockManagerMaster: akka.tcp://spark@XXXXX
:7077/user/BlockManagerMaster

stderr
=====
akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Actor[akka.tcp://spark@XXXXX:7077/]/user/BlockManagerMaster]
        at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
        at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
        at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
        at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at
akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
        at
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
        at
akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
        at
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
        at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
        at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
        at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
        at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
        at
akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
        at akka.actor.ActorRef.tell(ActorRef.scala:125)
        at
akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
        at
akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
        at
akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
        at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
        at
akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
        at
akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
        at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
        at
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
        at akka.actor.ActorCell.terminate(ActorCell.scala:338)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
        at akka.dispatch.Mailbox.run(Mailbox.scala:218)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "Thread-0"

Marek



2014-01-29 Benjamin Hindman <be...@gmail.com>

> Hey Marek,
>
> Can you include the stdout and stderr of the executors that are running?
> You should be able to get them from the Mesos master web UI or on the slave
> directly. If they are empty then you'll need to set up the log4j properties.
>
> Ben.
>
>
> On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka <
> marek.wiewiorka@gmail.com> wrote:
>
>> Hi All,
>> I trying to run my app over Mesos in in fine-grained mode and I'm getting
>> the attached errors(both Spark driver and Mesos slave).
>> The same app in coarse-grained mode runs fine. The only thing I'm
>> changing is setting the spark.mesos.coarse param to true/false.
>> System.setProperty("spark.mesos.coarse", "false")
>> If set to true everything is fine, if set to false(default)  errors below
>> occur.
>>
>> I'm using:
>> Mesos 0.16-rc3
>> Spark 0.9.0-rc5
>>
>> I would appreciate any hints.
>>
>> Regards,
>> Marek
>>
>>
>>
>> Spark driver
>> =============
>> 14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> 14/01/28 14:51:08 INFO Remoting: Starting remoting
>> 14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://spark@XXXXXX:49899]
>> 14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses:
>> [akka.tcp://spark@XXXXXX:49899]
>> 14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
>> 14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory
>> at /tmp/spark-local-20140128145108-e1d9
>> 14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with
>> capacity 589.2 MB.
>> 14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port
>> 34442 with id = ConnectionManagerId(XXXXXX,34442)
>> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register
>> BlockManager
>> 14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
>> Registering block manager XXXXXX:34442 with 589.2 MB RAM
>> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
>> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
>> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
>> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:41485
>> 14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started
>> at http://10.0.0.8:41485
>> 14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
>> 14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory
>> is /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
>> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
>> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
>> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:47676
>>  14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/storage/rdd,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/storage,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/stages/stage,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/stages/pool,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/stages,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/environment,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/executors,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/metrics/json,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/static,null}
>> 14/01/28 14:51:09 INFO handler.ContextHandler: started
>> o.e.j.s.h.ContextHandler{/,null}
>> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
>> SelectChannelConnector@0.0.0.0:4040
>> 14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at
>> http://XXXXXX:4040
>> 14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called
>> with curMem=0, maxMem=617820979
>> 14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as
>> values to memory (estimated size 90.8 KB, free 589.1 MB)
>> 14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to
>> process : 1
>> 14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at
>> SparkSeqSimpleJob.scala:48
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4
>> (reduceByKey at SparkSeqAnalysis.scala:355)
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at
>> SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0
>> (count at SparkSeqSimpleJob.scala:48)
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage:
>> List(Stage 1)
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents:
>> List(Stage 1)
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1
>> (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which
>> has no missing parents
>> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing
>> tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at
>> SparkSeqAnalysis.scala:355)
>> 14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
>> with 24 tasks
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as
>> TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>> (PROCESS_LOCAL)
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as
>> 2500 bytes in 13 ms
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as
>> TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>> (PROCESS_LOCAL)
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as
>> 2500 bytes in 0 ms
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as
>> TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>> (PROCESS_LOCAL)
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as
>> 2500 bytes in 1 ms
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as
>> TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX
>> (PROCESS_LOCAL)
>> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as
>> 2500 bytes in 1 ms
>> 14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for
>> 201401281448-134217738-5050-17953-1 from TaskSet 1.0
>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
>> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
>> 14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost:
>> 201401281448-134217738-5050-17953-1 (epoch 0)
>> 14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove
>> executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
>> 14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed
>> 201401281448-134217738-5050-17953-1 successfully in removeExecutor
>> 1
>>
>> Mesos SLAVE LOG
>> ============
>> I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor
>> 201401281448-134217738-5050-17953-1 of framework
>> '201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching
>> 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in
>> /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9
>> with resources mem(*):10240' for framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor
>> 201401281448-134217738-5050-17953-1 of framework
>> '201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for
>> framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor
>> 201401281448-134217738-5050-17953-1 of framework
>> '201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor
>> 201401281448-134217738-5050-17953-1 of framework
>> '201401281448-134217738-5050-17953-0000
>> I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at
>> 18386
>> I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor
>> 201401281448-134217738-5050-17953-1 of framework
>> 201401281448-134217738-5050-17953-0000 forked at pid 18386
>> I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for
>> executor '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for
>> executor '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for
>> executor '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for
>> executor '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of
>> terminated executor '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.428513 17992 slave.cpp:2145] Executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000 has exited with status 0
>> I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update
>> TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>> I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received
>> status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
>> task 0 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update
>> TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>> I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding
>> status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
>> task 0 of framework 201401281448-134217738-5050-17953-0000 to
>> master@10.0.0.8:5050
>> I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received
>> status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
>> task 1 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding
>> status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
>> task 1 of framework 201401281448-134217738-5050-17953-0000 to
>> master@10.0.0.8:5050
>> W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the
>> process tree rooted at pid 18386: Failed to find process 18386
>> I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update
>> resources for an unknown/killed executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update
>> resources for an unknown/killed executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update
>> TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>> I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update
>> resources for an unknown/killed executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received
>> status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
>> task 2 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding
>> status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
>> task 2 of framework 201401281448-134217738-5050-17953-0000 to
>> master@10.0.0.8:5050
>> I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update
>> TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
>> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
>> I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update
>> resources for an unknown/killed executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received
>> status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
>> task 3 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding
>> status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
>> task 3 of framework 201401281448-134217738-5050-17953-0000 to
>> master@10.0.0.8:5050
>> I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received
>> status update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475)
>> for task 0 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received
>> status update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b)
>> for task 1 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received
>> status update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab)
>> for task 2 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received
>> status update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b)
>> for task 3 of framework 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor
>> '201401281448-134217738-5050-17953-1' of framework
>> 201401281448-134217738-5050-17953-0000
>> I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework
>> 201401281448-134217738-5050-17953-0000
>>
>>
>

Re: Problem with running Spark over Mesos in fine-grained mode

Posted by Benjamin Hindman <be...@gmail.com>.
Hey Marek,

Can you include the stdout and stderr of the executors that are running?
You should be able to get them from the Mesos master web UI or on the slave
directly. If they are empty then you'll need to set up the log4j properties.

Ben.


On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka
<ma...@gmail.com>wrote:

> Hi All,
> I trying to run my app over Mesos in in fine-grained mode and I'm getting
> the attached errors(both Spark driver and Mesos slave).
> The same app in coarse-grained mode runs fine. The only thing I'm changing
> is setting the spark.mesos.coarse param to true/false.
> System.setProperty("spark.mesos.coarse", "false")
> If set to true everything is fine, if set to false(default)  errors below
> occur.
>
> I'm using:
> Mesos 0.16-rc3
> Spark 0.9.0-rc5
>
> I would appreciate any hints.
>
> Regards,
> Marek
>
>
>
> Spark driver
> =============
> 14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 14/01/28 14:51:08 INFO Remoting: Starting remoting
> 14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@XXXXXX:49899]
> 14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@XXXXXX:49899]
> 14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
> 14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory
> at /tmp/spark-local-20140128145108-e1d9
> 14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with
> capacity 589.2 MB.
> 14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port
> 34442 with id = ConnectionManagerId(XXXXXX,34442)
> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Registering block manager XXXXXX:34442 with 589.2 MB RAM
> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:41485
> 14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started
> at http://10.0.0.8:41485
> 14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
> 14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is
> /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:47676
> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage/rdd,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/stage,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/pool,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/environment,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/executors,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/metrics/json,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/static,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/,null}
> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at
> http://XXXXXX:4040
> 14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called
> with curMem=0, maxMem=617820979
> 14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as
> values to memory (estimated size 90.8 KB, free 589.1 MB)
> 14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at
> SparkSeqSimpleJob.scala:48
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4
> (reduceByKey at SparkSeqAnalysis.scala:355)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at
> SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count
> at SparkSeqSimpleJob.scala:48)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage:
> List(Stage 1)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage
> 1)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1
> (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which
> has no missing parents
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks
> from Stage 1 (MapPartitionsRDD[4] at reduceByKey at
> SparkSeqAnalysis.scala:355)
> 14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
> with 24 tasks
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as
> TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as
> 2500 bytes in 13 ms
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as
> TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as
> 2500 bytes in 0 ms
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as
> TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as
> 2500 bytes in 1 ms
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as
> TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as
> 2500 bytes in 1 ms
> 14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for
> 201401281448-134217738-5050-17953-1 from TaskSet 1.0
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
> 14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost:
> 201401281448-134217738-5050-17953-1 (epoch 0)
> 14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
> 14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed
> 201401281448-134217738-5050-17953-1 successfully in removeExecutor
> 1
>
> Mesos SLAVE LOG
> ============
> I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching
> 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in
> /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9
> with resources mem(*):10240' for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at
> 18386
> I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor
> 201401281448-134217738-5050-17953-1 of framework
> 201401281448-134217738-5050-17953-0000 forked at pid 18386
> I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of
> terminated executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.428513 17992 slave.cpp:2145] Executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000 has exited with status 0
> I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
> task 0 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
> task 1 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the
> process tree rooted at pid 18386: Failed to find process 18386
> I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
> task 2 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
> task 3 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
> task 0 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
> task 1 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
> task 2 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
> task 3 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework
> 201401281448-134217738-5050-17953-0000
>
>