You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by qingyang li <li...@gmail.com> on 2014/07/01 10:25:05 UTC

task always lost

i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started, when I
using spark-shell to submit one job, the tasks always lost.  here is the
log:
----------
14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
earlier: bigdata005
14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 on
executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
in 0 ms
14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
20140616-104524-1694607552-5050-26919-1 (epoch 3427)
14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
14/07/01 16:24:28 INFO BlockManagerMaster: Removed
20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
20140616-143932-1694607552-5050-4080-2 (epoch 3428)
14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
14/07/01 16:24:28 INFO BlockManagerMaster: Removed
20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
earlier: bigdata005
14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
earlier: bigdata001
14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 on
executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
in 0 ms
14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 on
executor 20140616-104524-1694607552-5050-26919-1: bigdata001 (PROCESS_LOCAL)
14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570 bytes
in 0 ms


it seems other guy has also encountered such problem,
http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E

Re: task always lost

Posted by Aaron Davidson <il...@gmail.com>.

The issue you're seeing is not the same as the one you linked to -- your
serialized task sizes are very small, and Mesos fine-grained mode doesn't
use Akka anyway.

The error log you printed seems to be from some sort of Mesos logs, but do
you happen to have the logs from the actual executors themselves? These
should be Spark logs which hopefully show the actual Exception (or lack
thereof) before the executors die.

The tasks are dying very quickly, so this is probably either related to
your application logic throwing some sort of fatal JVM error or due to your
Mesos setup. I'm not sure if that "Failed to fetch URIs for container" is
fatal or not.


On Wed, Jul 2, 2014 at 2:44 AM, qingyang li <li...@gmail.com>
wrote:

> executor always been removed.
>
> someone encountered same issue
> https://groups.google.com/forum/#!topic/spark-users/-mYn6BF-Y5Y
>
> -------------
> 14/07/02 17:41:16 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> 14/07/02 17:41:16 INFO storage.BlockManagerMaster: Removed
> 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> 14/07/02 17:41:16 DEBUG spark.MapOutputTrackerMaster: Increasing epoch to
> 10
> 14/07/02 17:41:16 INFO scheduler.DAGScheduler: Host gained which was in
> lost list earlier: bigdata001
> 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
> TaskSet_0, runningTasks: 0
> 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
> TaskSet_0, runningTasks: 0
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
> 12 on executor 20140616-143932-1694607552-5050-4080-3: bigdata004
> (NODE_LOCAL)
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
> 10785 bytes in 1 ms
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
> 13 on executor 20140616-104524-1694607552-5050-26919-3: bigdata002
> (NODE_LOCAL
>
>
> 2014-07-02 12:01 GMT+08:00 qingyang li <li...@gmail.com>:
>
> > also this one in warning log:
> >
> > E0702 11:35:08.869998 17840 slave.cpp:2310] Container
> > 'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor
> > '20140616-104524-1694607552-5050-26919-1' of framework
> > '20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
> > fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
> > status 32512
> >
> >
> > 2014-07-02 11:46 GMT+08:00 qingyang li <li...@gmail.com>:
> >
> > Here is the log:
> >>
> >> E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor
> container
> >> for executor 20140616-104524-1694607552-5050-26919-1 of framework
> >> 20140702-102939-1694607552-5050-14846-0000: Not monitored
> >>
> >>
> >> 2014-07-02 1:45 GMT+08:00 Aaron Davidson <il...@gmail.com>:
> >>
> >> Can you post the logs from any of the dying executors?
> >>>
> >>>
> >>> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <li...@gmail.com>
> >>> wrote:
> >>>
> >>> > i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started,
> >>> when I
> >>> > using spark-shell to submit one job, the tasks always lost.  here is
> >>> the
> >>> > log:
> >>> > ----------
> >>> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata005
> >>> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID
> 4042
> >>> on
> >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> >>> (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> >>> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> >>> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
> >>> executor
> >>> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> >>> > 20140616-104524-1694607552-5050-26919-1 successfully in
> removeExecutor
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> >>> > 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
> >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> >>> > 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
> >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
> >>> executor
> >>> > 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
> >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> >>> > 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata005
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata001
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID
> 4043
> >>> on
> >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> >>> (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID
> 4044
> >>> on
> >>> > executor 20140616-104524-1694607552-5050-26919-1: bigdata001
> >>> > (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> >
> >>> >
> >>> > it seems other guy has also encountered such problem,
> >>> >
> >>> >
> >>>
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
> >>> >
> >>>
> >>
> >>
> >
>

Re: task always lost

Posted by qingyang li <li...@gmail.com>.

executor always been removed.

someone encountered same issue
https://groups.google.com/forum/#!topic/spark-users/-mYn6BF-Y5Y

-------------
14/07/02 17:41:16 INFO storage.BlockManagerMasterActor: Trying to remove
executor 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
14/07/02 17:41:16 INFO storage.BlockManagerMaster: Removed
20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
14/07/02 17:41:16 DEBUG spark.MapOutputTrackerMaster: Increasing epoch to 10
14/07/02 17:41:16 INFO scheduler.DAGScheduler: Host gained which was in
lost list earlier: bigdata001
14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
TaskSet_0, runningTasks: 0
14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
TaskSet_0, runningTasks: 0
14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
12 on executor 20140616-143932-1694607552-5050-4080-3: bigdata004
(NODE_LOCAL)
14/07/02 17:41:16 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
10785 bytes in 1 ms
14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
13 on executor 20140616-104524-1694607552-5050-26919-3: bigdata002
(NODE_LOCAL


2014-07-02 12:01 GMT+08:00 qingyang li <li...@gmail.com>:

> also this one in warning log:
>
> E0702 11:35:08.869998 17840 slave.cpp:2310] Container
> 'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor
> '20140616-104524-1694607552-5050-26919-1' of framework
> '20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
> fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
> status 32512
>
>
> 2014-07-02 11:46 GMT+08:00 qingyang li <li...@gmail.com>:
>
> Here is the log:
>>
>> E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor container
>> for executor 20140616-104524-1694607552-5050-26919-1 of framework
>> 20140702-102939-1694607552-5050-14846-0000: Not monitored
>>
>>
>> 2014-07-02 1:45 GMT+08:00 Aaron Davidson <il...@gmail.com>:
>>
>> Can you post the logs from any of the dying executors?
>>>
>>>
>>> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <li...@gmail.com>
>>> wrote:
>>>
>>> > i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started,
>>> when I
>>> > using spark-shell to submit one job, the tasks always lost.  here is
>>> the
>>> > log:
>>> > ----------
>>> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
>>> > earlier: bigdata005
>>> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042
>>> on
>>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
>>> (PROCESS_LOCAL)
>>> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570
>>> bytes
>>> > in 0 ms
>>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
>>> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
>>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
>>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
>>> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
>>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
>>> executor
>>> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
>>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
>>> > 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
>>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
>>> > 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
>>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
>>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
>>> > 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
>>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
>>> executor
>>> > 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
>>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
>>> > 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
>>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
>>> > earlier: bigdata005
>>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
>>> > earlier: bigdata001
>>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043
>>> on
>>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
>>> (PROCESS_LOCAL)
>>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570
>>> bytes
>>> > in 0 ms
>>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044
>>> on
>>> > executor 20140616-104524-1694607552-5050-26919-1: bigdata001
>>> > (PROCESS_LOCAL)
>>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570
>>> bytes
>>> > in 0 ms
>>> >
>>> >
>>> > it seems other guy has also encountered such problem,
>>> >
>>> >
>>> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
>>> >
>>>
>>
>>
>

Re: task always lost

Posted by qingyang li <li...@gmail.com>.

also this one in warning log:

E0702 11:35:08.869998 17840 slave.cpp:2310] Container
'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor
'20140616-104524-1694607552-5050-26919-1' of framework
'20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
status 32512


2014-07-02 11:46 GMT+08:00 qingyang li <li...@gmail.com>:

> Here is the log:
>
> E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor container
> for executor 20140616-104524-1694607552-5050-26919-1 of framework
> 20140702-102939-1694607552-5050-14846-0000: Not monitored
>
>
> 2014-07-02 1:45 GMT+08:00 Aaron Davidson <il...@gmail.com>:
>
> Can you post the logs from any of the dying executors?
>>
>>
>> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <li...@gmail.com>
>> wrote:
>>
>> > i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started,
>> when I
>> > using spark-shell to submit one job, the tasks always lost.  here is the
>> > log:
>> > ----------
>> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
>> > earlier: bigdata005
>> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042
>> on
>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
>> (PROCESS_LOCAL)
>> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570
>> bytes
>> > in 0 ms
>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
>> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
>> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
>> executor
>> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
>> > 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
>> > 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
>> > 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
>> executor
>> > 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
>> > 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
>> > earlier: bigdata005
>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
>> > earlier: bigdata001
>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043
>> on
>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
>> (PROCESS_LOCAL)
>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570
>> bytes
>> > in 0 ms
>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044
>> on
>> > executor 20140616-104524-1694607552-5050-26919-1: bigdata001
>> > (PROCESS_LOCAL)
>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570
>> bytes
>> > in 0 ms
>> >
>> >
>> > it seems other guy has also encountered such problem,
>> >
>> >
>> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
>> >
>>
>
>

Re: task always lost

Posted by qingyang li <li...@gmail.com>.

Here is the log:

E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor container
for executor 20140616-104524-1694607552-5050-26919-1 of framework
20140702-102939-1694607552-5050-14846-0000: Not monitored


2014-07-02 1:45 GMT+08:00 Aaron Davidson <il...@gmail.com>:

> Can you post the logs from any of the dying executors?
>
>
> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <li...@gmail.com>
> wrote:
>
> > i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started,
> when I
> > using spark-shell to submit one job, the tasks always lost.  here is the
> > log:
> > ----------
> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
> > earlier: bigdata005
> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 on
> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> (PROCESS_LOCAL)
> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> bytes
> > in 0 ms
> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> > 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> > 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> > 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> > 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> > 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> > earlier: bigdata005
> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> > earlier: bigdata001
> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 on
> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> (PROCESS_LOCAL)
> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> bytes
> > in 0 ms
> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 on
> > executor 20140616-104524-1694607552-5050-26919-1: bigdata001
> > (PROCESS_LOCAL)
> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570
> bytes
> > in 0 ms
> >
> >
> > it seems other guy has also encountered such problem,
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
> >
>

Re: task always lost

Posted by Aaron Davidson <il...@gmail.com>.

Can you post the logs from any of the dying executors?


On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <li...@gmail.com>
wrote:

> i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started, when I
> using spark-shell to submit one job, the tasks always lost.  here is the
> log:
> ----------
> 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata005
> 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 on
> executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
> 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
> in 0 ms
> 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
> 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
> 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
> 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
> 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
> 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata005
> 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata001
> 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 on
> executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
> 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
> in 0 ms
> 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 on
> executor 20140616-104524-1694607552-5050-26919-1: bigdata001
> (PROCESS_LOCAL)
> 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570 bytes
> in 0 ms
>
>
> it seems other guy has also encountered such problem,
>
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
>