You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sébastien Rainville <se...@gmail.com> on 2014/06/17 20:57:55 UTC

Problems running Spark job on mesos in fine-grained mode

Hi,

I'm having trouble running spark on mesos in fine-grained mode. I'm running
spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which most of
the time, but not always, cause the job to fail. The same code is running
fine in coarse-grained mode. I see the following exceptions in the logs of
the spark driver:

W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task 21
with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task 22
with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
201311011608-1369465866-5050-9189-46 from TaskSet 0.0
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
201311011608-1369465866-5050-9189-46 (epoch 0)
14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor
201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
14/06/17 10:57:36 INFO BlockManagerMaster: Removed
201311011608-1369465866-5050-9189-46 successfully in removeExecutor
14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier:
ca1-dcc1-0065.lab.mtl

I don't see any exceptions in the spark executor logs. The only error
message I found in mesos itself is warnings in the mesos master:

W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 :
Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
ports(*):[11900-11919, 1192
1-11995, 11997-11999]; cpus(*):1
W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 :
Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
ports(*):[11900-11919, 1192
1-11995, 11997-11999]; cpus(*):1
W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 :
Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
ports(*):[11900-
11960, 11962-11978, 11980-11999]
W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-46 on slave
201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-34 on slave
201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-59 on slave
201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-18 on slave
201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
... (more of those "Ignoring unknown exited executor")


I analyzed the difference in between the execution of the same job in
coarse-grained mode and fine-grained mode, and I noticed that in the
fine-grained mode the tasks get executed on executors different than the
ones reported in spark, as if spark and mesos get out of sync as to which
executor is responsible for which task. See the following:


Coarse-grained mode:

SparkMesosTask IndexTask IDExecutorStatusTask ID (UI)Task NameTask ID (logs)
ExecutorState0066SUCCESS4"Task 4"066RUNNING1159SUCCESS0"Task 0"159RUNNING22
54SUCCESS10"Task 10"254RUNNING33128SUCCESS6"Task 6"3128RUNNING...

Fine-grained mode:

SparkMesosTask IndexTask IDExecutorTask ID (UI)Task NameTask ID (logs)
ExecutorState023108SUCCESS23"task 0.0:0"2327FINISHED01965FAILED19"task
0.0:0"1986FINISHED12165FAILEDMesos executor was never
created12492SUCCESS24"task
0.0:1"24129FINISHED22265FAILEDMesos executor was never created225100SUCCESS
25"task 0.0:2"2584FINISHED32680SUCCESS26"task 0.0:3"26124FINISHED42765FAILED
27"task 0.0:4"27108FINISHED42992SUCCESS29"task
0.0:4"2965FINISHED52865FAILEDMesos
executor was never created53077SUCCESS30"task
0.0:5"3062FINISHED6053SUCCESS0"task
0.0:6"041FINISHED7177SUCCESS1"task 0.0:7"1114FINISHED...


Is it normal that the executor reported in spark and mesos to be different
when running in fine-grained mode?

Please note that in this particular example the job actually succeeded, but
most of the time it's failing after 4 failed attempts of a given task. This
job never fails in coarse-grained mode. Every job is working in
coarse-grained mode and failing the same way in fine-grained mode.

Does anybody have an idea what the problem could be?

Thanks,

- Sebastien

Re: Problems running Spark job on mesos in fine-grained mode

Posted by Sébastien Rainville <se...@gmail.com>.
Hi Mayur,

I use primarily Scala, but I tested with pyspark, and it's working fine too
post the patch.

Thanks,

- Sebastien


On Tue, Jun 24, 2014 at 6:08 PM, Mayur Rustagi <ma...@gmail.com>
wrote:

> Hi Sebastien,
> Are you using Pyspark by any chance, is that working for you (post the
> patch?)
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco <fe...@gmail.com> wrote:
>
>> I'm getting the same behavior and it's crucial I get it fixed for an
>> evaluation of Spark + Mesos within my company.
>>
>> I'm bumping +1 for the request of putting this fix in the 1.0.1 if
>> possible!
>>
>> thanks,
>> Federico
>>
>>
>> 2014-06-20 20:51 GMT+02:00 Sébastien Rainville <
>> sebastienrainville@gmail.com>:
>>
>> Hi,
>>>
>>> this is just a follow-up regarding this issue. Turns out that it's
>>> caused by a bug in Spark. I created a case for it:
>>> https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.
>>>
>>> Any chance this could be included in the 1.0.1 release?
>>>
>>> Thanks,
>>>
>>> - Sebastien
>>>
>>>
>>>
>>> On Tue, Jun 17, 2014 at 2:57 PM, Sébastien Rainville <
>>> sebastienrainville@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm having trouble running spark on mesos in fine-grained mode. I'm
>>>> running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which
>>>> most of the time, but not always, cause the job to fail. The same code is
>>>> running fine in coarse-grained mode. I see the following exceptions in the
>>>> logs of the spark driver:
>>>>
>>>> W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task
>>>> 21 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>>>> W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task
>>>> 22 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>>>> 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
>>>> 201311011608-1369465866-5050-9189-46 from TaskSet 0.0
>>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
>>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
>>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
>>>> 14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
>>>> 201311011608-1369465866-5050-9189-46 (epoch 0)
>>>> 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove
>>>> executor 201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
>>>> 14/06/17 10:57:36 INFO BlockManagerMaster: Removed
>>>> 201311011608-1369465866-5050-9189-46 successfully in removeExecutor
>>>> 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
>>>> 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list
>>>> earlier: ca1-dcc1-0065.lab.mtl
>>>>
>>>> I don't see any exceptions in the spark executor logs. The only error
>>>> message I found in mesos itself is warnings in the mesos master:
>>>>
>>>> W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21
>>>> : Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>>>> ports(*):[11900-11919, 1192
>>>> 1-11995, 11997-11999]; cpus(*):1
>>>> W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22
>>>> : Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>>>> ports(*):[11900-11919, 1192
>>>> 1-11995, 11997-11999]; cpus(*):1
>>>> W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28
>>>> : Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>>> mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
>>>> ports(*):[11900-
>>>> 11960, 11962-11978, 11980-11999]
>>>> W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-46 on slave
>>>> 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
>>>> W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-34 on slave
>>>> 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
>>>> W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-59 on slave
>>>> 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
>>>> W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-18 on slave
>>>> 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
>>>> ... (more of those "Ignoring unknown exited executor")
>>>>
>>>>
>>>> I analyzed the difference in between the execution of the same job in
>>>> coarse-grained mode and fine-grained mode, and I noticed that in the
>>>> fine-grained mode the tasks get executed on executors different than the
>>>> ones reported in spark, as if spark and mesos get out of sync as to which
>>>> executor is responsible for which task. See the following:
>>>>
>>>>
>>>> Coarse-grained mode:
>>>>
>>>>  Spark Mesos Task IndexTask ID ExecutorStatusTask ID (UI)Task Name
>>>> Task ID (logs)ExecutorState 0066SUCCESS 4"Task 4"0 66RUNNING1 159
>>>> SUCCESS0 "Task 0"159 RUNNING22 54SUCCESS10"Task 10" 254RUNNING 33128
>>>> SUCCESS6"Task 6" 3128RUNNING ...
>>>>
>>>> Fine-grained mode:
>>>>
>>>>  Spark Mesos Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID
>>>> (logs) ExecutorState0 23108SUCCESS 23"task 0.0:0"23 27FINISHED0 1965
>>>> FAILED19 "task 0.0:0"1986 FINISHED1 2165FAILED Mesos executor was
>>>> never created124 92SUCCESS24"task 0.0:1" 24129FINISHED 22265 FAILEDMesos
>>>> executor was never created 225100SUCCESS 25"task 0.0:2" 2584FINISHED 3
>>>> 2680SUCCESS 26"task 0.0:3"26 124FINISHED 42765FAILED 27"task 0.0:4"27
>>>> 108FINISHED 42992SUCCESS 29"task 0.0:4"29 65FINISHED 52865FAILED Mesos
>>>> executor was never created5 3077SUCCESS30 "task 0.0:5"3062 FINISHED6 0
>>>> 53SUCCESS0 "task 0.0:6"041 FINISHED7 177SUCCESS1 "task 0.0:7"1114
>>>> FINISHED...
>>>>
>>>>
>>>> Is it normal that the executor reported in spark and mesos to be
>>>> different when running in fine-grained mode?
>>>>
>>>> Please note that in this particular example the job actually succeeded,
>>>> but most of the time it's failing after 4 failed attempts of a given task.
>>>> This job never fails in coarse-grained mode. Every job is working in
>>>> coarse-grained mode and failing the same way in fine-grained mode.
>>>>
>>>> Does anybody have an idea what the problem could be?
>>>>
>>>> Thanks,
>>>>
>>>> - Sebastien
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Problems running Spark job on mesos in fine-grained mode

Posted by Mayur Rustagi <ma...@gmail.com>.
Hi Sebastien,
Are you using Pyspark by any chance, is that working for you (post the
patch?)

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco <fe...@gmail.com> wrote:

> I'm getting the same behavior and it's crucial I get it fixed for an
> evaluation of Spark + Mesos within my company.
>
> I'm bumping +1 for the request of putting this fix in the 1.0.1 if
> possible!
>
> thanks,
> Federico
>
>
> 2014-06-20 20:51 GMT+02:00 Sébastien Rainville <
> sebastienrainville@gmail.com>:
>
> Hi,
>>
>> this is just a follow-up regarding this issue. Turns out that it's caused
>> by a bug in Spark. I created a case for it:
>> https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.
>>
>> Any chance this could be included in the 1.0.1 release?
>>
>> Thanks,
>>
>> - Sebastien
>>
>>
>>
>> On Tue, Jun 17, 2014 at 2:57 PM, Sébastien Rainville <
>> sebastienrainville@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm having trouble running spark on mesos in fine-grained mode. I'm
>>> running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which
>>> most of the time, but not always, cause the job to fail. The same code is
>>> running fine in coarse-grained mode. I see the following exceptions in the
>>> logs of the spark driver:
>>>
>>> W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task 21
>>> with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>>> W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task 22
>>> with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>>> 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
>>> 201311011608-1369465866-5050-9189-46 from TaskSet 0.0
>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
>>> 14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
>>> 201311011608-1369465866-5050-9189-46 (epoch 0)
>>> 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove
>>> executor 201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
>>> 14/06/17 10:57:36 INFO BlockManagerMaster: Removed
>>> 201311011608-1369465866-5050-9189-46 successfully in removeExecutor
>>> 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
>>> 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list
>>> earlier: ca1-dcc1-0065.lab.mtl
>>>
>>> I don't see any exceptions in the spark executor logs. The only error
>>> message I found in mesos itself is warnings in the mesos master:
>>>
>>> W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21
>>> : Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>>> ports(*):[11900-11919, 1192
>>> 1-11995, 11997-11999]; cpus(*):1
>>> W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22
>>> : Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>>> ports(*):[11900-11919, 1192
>>> 1-11995, 11997-11999]; cpus(*):1
>>> W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28
>>> : Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>> mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
>>> ports(*):[11900-
>>> 11960, 11962-11978, 11980-11999]
>>> W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
>>> executor 201311011608-1369465866-5050-9189-46 on slave
>>> 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
>>> W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
>>> executor 201311011608-1369465866-5050-9189-34 on slave
>>> 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
>>> W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
>>> executor 201311011608-1369465866-5050-9189-59 on slave
>>> 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
>>> W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
>>> executor 201311011608-1369465866-5050-9189-18 on slave
>>> 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
>>> ... (more of those "Ignoring unknown exited executor")
>>>
>>>
>>> I analyzed the difference in between the execution of the same job in
>>> coarse-grained mode and fine-grained mode, and I noticed that in the
>>> fine-grained mode the tasks get executed on executors different than the
>>> ones reported in spark, as if spark and mesos get out of sync as to which
>>> executor is responsible for which task. See the following:
>>>
>>>
>>> Coarse-grained mode:
>>>
>>>  Spark Mesos Task IndexTask ID ExecutorStatusTask ID (UI)Task Name Task
>>> ID (logs)ExecutorState 0066SUCCESS 4"Task 4"0 66RUNNING1 159SUCCESS0 "Task
>>> 0"159 RUNNING22 54SUCCESS10"Task 10" 254RUNNING 33128 SUCCESS6"Task 6" 3
>>> 128RUNNING ...
>>>
>>> Fine-grained mode:
>>>
>>>  Spark Mesos Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID
>>> (logs) ExecutorState0 23108SUCCESS 23"task 0.0:0"23 27FINISHED0 1965
>>> FAILED19 "task 0.0:0"1986 FINISHED1 2165FAILED Mesos executor was never
>>> created124 92SUCCESS24"task 0.0:1" 24129FINISHED 22265 FAILEDMesos
>>> executor was never created 225100SUCCESS 25"task 0.0:2" 2584FINISHED 326
>>> 80SUCCESS 26"task 0.0:3"26 124FINISHED 42765FAILED 27"task 0.0:4"27 108
>>> FINISHED 42992SUCCESS 29"task 0.0:4"29 65FINISHED 52865FAILED Mesos
>>> executor was never created5 3077SUCCESS30 "task 0.0:5"3062 FINISHED6 053
>>> SUCCESS0 "task 0.0:6"041 FINISHED7 177SUCCESS1 "task 0.0:7"1114 FINISHED
>>> ...
>>>
>>>
>>> Is it normal that the executor reported in spark and mesos to be
>>> different when running in fine-grained mode?
>>>
>>> Please note that in this particular example the job actually succeeded,
>>> but most of the time it's failing after 4 failed attempts of a given task.
>>> This job never fails in coarse-grained mode. Every job is working in
>>> coarse-grained mode and failing the same way in fine-grained mode.
>>>
>>> Does anybody have an idea what the problem could be?
>>>
>>> Thanks,
>>>
>>> - Sebastien
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Problems running Spark job on mesos in fine-grained mode

Posted by Fedechicco <fe...@gmail.com>.
I'm getting the same behavior and it's crucial I get it fixed for an
evaluation of Spark + Mesos within my company.

I'm bumping +1 for the request of putting this fix in the 1.0.1 if possible!

thanks,
Federico


2014-06-20 20:51 GMT+02:00 Sébastien Rainville <sebastienrainville@gmail.com
>:

> Hi,
>
> this is just a follow-up regarding this issue. Turns out that it's caused
> by a bug in Spark. I created a case for it:
> https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.
>
> Any chance this could be included in the 1.0.1 release?
>
> Thanks,
>
> - Sebastien
>
>
>
> On Tue, Jun 17, 2014 at 2:57 PM, Sébastien Rainville <
> sebastienrainville@gmail.com> wrote:
>
>> Hi,
>>
>> I'm having trouble running spark on mesos in fine-grained mode. I'm
>> running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which
>> most of the time, but not always, cause the job to fail. The same code is
>> running fine in coarse-grained mode. I see the following exceptions in the
>> logs of the spark driver:
>>
>> W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task 21
>> with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>> W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task 22
>> with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>> 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
>> 201311011608-1369465866-5050-9189-46 from TaskSet 0.0
>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
>> 14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
>> 201311011608-1369465866-5050-9189-46 (epoch 0)
>> 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor
>> 201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
>> 14/06/17 10:57:36 INFO BlockManagerMaster: Removed
>> 201311011608-1369465866-5050-9189-46 successfully in removeExecutor
>> 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
>> 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier:
>> ca1-dcc1-0065.lab.mtl
>>
>> I don't see any exceptions in the spark executor logs. The only error
>> message I found in mesos itself is warnings in the mesos master:
>>
>> W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 :
>> Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>> ports(*):[11900-11919, 1192
>> 1-11995, 11997-11999]; cpus(*):1
>> W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 :
>> Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>> ports(*):[11900-11919, 1192
>> 1-11995, 11997-11999]; cpus(*):1
>> W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 :
>> Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
>> mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
>> ports(*):[11900-
>> 11960, 11962-11978, 11980-11999]
>> W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
>> executor 201311011608-1369465866-5050-9189-46 on slave
>> 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
>> W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
>> executor 201311011608-1369465866-5050-9189-34 on slave
>> 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
>> W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
>> executor 201311011608-1369465866-5050-9189-59 on slave
>> 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
>> W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
>> executor 201311011608-1369465866-5050-9189-18 on slave
>> 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
>> ... (more of those "Ignoring unknown exited executor")
>>
>>
>> I analyzed the difference in between the execution of the same job in
>> coarse-grained mode and fine-grained mode, and I noticed that in the
>> fine-grained mode the tasks get executed on executors different than the
>> ones reported in spark, as if spark and mesos get out of sync as to which
>> executor is responsible for which task. See the following:
>>
>>
>> Coarse-grained mode:
>>
>>  Spark Mesos Task IndexTask ID ExecutorStatusTask ID (UI)Task Name Task
>> ID (logs)ExecutorState 0066SUCCESS 4"Task 4"0 66RUNNING1 159SUCCESS0 "Task
>> 0"159 RUNNING22 54SUCCESS10"Task 10" 254RUNNING 33128 SUCCESS6"Task 6" 3
>> 128RUNNING ...
>>
>> Fine-grained mode:
>>
>>  Spark Mesos Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID
>> (logs) ExecutorState0 23108SUCCESS 23"task 0.0:0"23 27FINISHED0 1965
>> FAILED19 "task 0.0:0"1986 FINISHED1 2165FAILED Mesos executor was never
>> created124 92SUCCESS24"task 0.0:1" 24129FINISHED 22265 FAILEDMesos
>> executor was never created 225100SUCCESS 25"task 0.0:2" 2584FINISHED 326
>> 80SUCCESS 26"task 0.0:3"26 124FINISHED 42765FAILED 27"task 0.0:4"27 108
>> FINISHED 42992SUCCESS 29"task 0.0:4"29 65FINISHED 52865FAILED Mesos
>> executor was never created5 3077SUCCESS30 "task 0.0:5"3062 FINISHED6 053
>> SUCCESS0 "task 0.0:6"041 FINISHED7 177SUCCESS1 "task 0.0:7"1114 FINISHED
>> ...
>>
>>
>> Is it normal that the executor reported in spark and mesos to be
>> different when running in fine-grained mode?
>>
>> Please note that in this particular example the job actually succeeded,
>> but most of the time it's failing after 4 failed attempts of a given task.
>> This job never fails in coarse-grained mode. Every job is working in
>> coarse-grained mode and failing the same way in fine-grained mode.
>>
>> Does anybody have an idea what the problem could be?
>>
>> Thanks,
>>
>> - Sebastien
>>
>>
>>
>>
>>
>

Re: Problems running Spark job on mesos in fine-grained mode

Posted by Sébastien Rainville <se...@gmail.com>.
Hi,

this is just a follow-up regarding this issue. Turns out that it's caused
by a bug in Spark. I created a case for it:
https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.

Any chance this could be included in the 1.0.1 release?

Thanks,

- Sebastien



On Tue, Jun 17, 2014 at 2:57 PM, Sébastien Rainville <
sebastienrainville@gmail.com> wrote:

> Hi,
>
> I'm having trouble running spark on mesos in fine-grained mode. I'm
> running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which
> most of the time, but not always, cause the job to fail. The same code is
> running fine in coarse-grained mode. I see the following exceptions in the
> logs of the spark driver:
>
> W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task 21
> with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
> W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task 22
> with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
> 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
> 201311011608-1369465866-5050-9189-46 from TaskSet 0.0
> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
> 14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
> 201311011608-1369465866-5050-9189-46 (epoch 0)
> 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor
> 201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
> 14/06/17 10:57:36 INFO BlockManagerMaster: Removed
> 201311011608-1369465866-5050-9189-46 successfully in removeExecutor
> 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
> 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier:
> ca1-dcc1-0065.lab.mtl
>
> I don't see any exceptions in the spark executor logs. The only error
> message I found in mesos itself is warnings in the mesos master:
>
> W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 :
> Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
> ports(*):[11900-11919, 1192
> 1-11995, 11997-11999]; cpus(*):1
> W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 :
> Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
> ports(*):[11900-11919, 1192
> 1-11995, 11997-11999]; cpus(*):1
> W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 :
> Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
> mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
> ports(*):[11900-
> 11960, 11962-11978, 11980-11999]
> W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
> executor 201311011608-1369465866-5050-9189-46 on slave
> 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
> W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
> executor 201311011608-1369465866-5050-9189-34 on slave
> 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
> W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
> executor 201311011608-1369465866-5050-9189-59 on slave
> 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
> W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
> executor 201311011608-1369465866-5050-9189-18 on slave
> 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
> ... (more of those "Ignoring unknown exited executor")
>
>
> I analyzed the difference in between the execution of the same job in
> coarse-grained mode and fine-grained mode, and I noticed that in the
> fine-grained mode the tasks get executed on executors different than the
> ones reported in spark, as if spark and mesos get out of sync as to which
> executor is responsible for which task. See the following:
>
>
> Coarse-grained mode:
>
>  Spark Mesos Task IndexTask ID ExecutorStatusTask ID (UI)Task Name Task
> ID (logs)ExecutorState 0066SUCCESS 4"Task 4"0 66RUNNING1 159SUCCESS0 "Task
> 0"159 RUNNING22 54SUCCESS10"Task 10" 254RUNNING 33128 SUCCESS6"Task 6" 3
> 128RUNNING ...
>
> Fine-grained mode:
>
>  Spark Mesos Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID (logs)
> ExecutorState0 23108SUCCESS 23"task 0.0:0"23 27FINISHED0 1965FAILED19 "task
> 0.0:0"1986 FINISHED1 2165FAILED Mesos executor was never created124 92
> SUCCESS24"task 0.0:1" 24129FINISHED 22265 FAILEDMesos executor was never
> created 225100SUCCESS 25"task 0.0:2" 2584FINISHED 32680SUCCESS 26"task
> 0.0:3"26 124FINISHED 42765FAILED 27"task 0.0:4"27 108FINISHED 42992SUCCESS
> 29"task 0.0:4"29 65FINISHED 52865FAILED Mesos executor was never created5
> 3077SUCCESS30 "task 0.0:5"3062 FINISHED6 053SUCCESS0 "task 0.0:6"041
> FINISHED7 177SUCCESS1 "task 0.0:7"1114 FINISHED...
>
>
> Is it normal that the executor reported in spark and mesos to be different
> when running in fine-grained mode?
>
> Please note that in this particular example the job actually succeeded,
> but most of the time it's failing after 4 failed attempts of a given task.
> This job never fails in coarse-grained mode. Every job is working in
> coarse-grained mode and failing the same way in fine-grained mode.
>
> Does anybody have an idea what the problem could be?
>
> Thanks,
>
> - Sebastien
>
>
>
>
>