You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by qingyang li <li...@gmail.com> on 2014/07/01 10:24:33 UTC

Tasks always lost

i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started, when I
using spark-shell to submit one job, the tasks always lost.  here is the
log:
----------
14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
earlier: bigdata005
14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 on
executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
in 0 ms
14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
20140616-104524-1694607552-5050-26919-1 (epoch 3427)
14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
14/07/01 16:24:28 INFO BlockManagerMaster: Removed
20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
20140616-143932-1694607552-5050-4080-2 (epoch 3428)
14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
14/07/01 16:24:28 INFO BlockManagerMaster: Removed
20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
earlier: bigdata005
14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
earlier: bigdata001
14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 on
executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
in 0 ms
14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 on
executor 20140616-104524-1694607552-5050-26919-1: bigdata001 (PROCESS_LOCAL)
14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570 bytes
in 0 ms


it seems other guy has also encountered such problem,
http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E

Re: Tasks always lost

Posted by qingyang li <li...@gmail.com>.

i have set export SPARK_EXECUTOR_URI=hdfs://
192.168.1.101:8020/user/spark/spark-0.9.0-incubating-bin-hadoop2.tgz  in
spark-env.sh.
salve can access spark-0.9.0-incubating-bin-hadoop2.tgz


2014-07-03 1:24 GMT+08:00 Vinod Kone <vi...@gmail.com>:

> On Tue, Jul 1, 2014 at 9:12 PM, qingyang li <li...@gmail.com>
> wrote:
>
> > '20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
> > fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
> > status 32512
> >
>
> looks like the mesos slave is unable to fetch the executor. where is the
> spark executor stored i.e., what is the URI? is it accessible from the
> slave host?
>

Re: Tasks always lost

Posted by Vinod Kone <vi...@gmail.com>.

On Tue, Jul 1, 2014 at 9:12 PM, qingyang li <li...@gmail.com>
wrote:

> '20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
> fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
> status 32512
>

looks like the mesos slave is unable to fetch the executor. where is the
spark executor stored i.e., what is the URI? is it accessible from the
slave host?

Re: Tasks always lost

Posted by qingyang li <li...@gmail.com>.

here is the slave log:

E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor container
for executor 20140616-104524-1694607552-5050-26919-1 of framework
20140702-102939-1694607552-5050-14846-0000: Not monitored

E0702 11:35:08.869998 17840 slave.cpp:2310] Container
'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor
'20140616-104524-1694607552-5050-26919-1' of framework
'20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
status 32512



2014-07-01 16:24 GMT+08:00 qingyang li <li...@gmail.com>:

> i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started, when
> I using spark-shell to submit one job, the tasks always lost.  here is the
> log:
> ----------
> 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata005
> 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 on
> executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
> 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
> in 0 ms
> 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
> 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
> 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
> 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
> 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
> 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata005
> 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata001
> 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 on
> executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
> 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
> in 0 ms
> 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 on
> executor 20140616-104524-1694607552-5050-26919-1: bigdata001 (PROCESS_LOCAL)
> 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570 bytes
> in 0 ms
>
>
> it seems other guy has also encountered such problem,
>
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
>