You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ji Yan <ji...@drive.ai> on 2017/12/06 06:45:24 UTC

Spark job only starts tasks on a single node

Hi all,

I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto
several nodes. I try to set the number of executors by the formula
(spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
will try to fill up on one mesos node as many executors as it can, then it
stops going to other mesos nodes despite that it has not done scheduling
all the executors I have asked it to yet! This is super weird!

Did anyone notice this behavior before? Any help appreciated!

Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.

Re: Spark job only starts tasks on a single node

Posted by Ji Yan <ji...@drive.ai>.
This used to work. Only thing that has changed is that the mesos installed
on Spark executor is on a different version from before. My Spark executor
runs in a container, the image of which has mesos installed. The version of
that mesos is actually different from the version of mesos master. Not sure
if that is the problem though. I am trying to bring back the old version
mesos to Spark executor image. Did anyone know that mesos slave and master
not running the same version could lead to this problem?

On Thu, Dec 7, 2017 at 11:34 AM, Art Rand <ar...@gmail.com> wrote:

> Sounds a little like the driver got one offer when it was using zero
> resources, then it's not getting any more. How many frameworks (and which)
> are running on the cluster? The Mesos Master log should say which
> frameworks are getting offers, and should help diagnose the problem.
>
> A
>
> On Thu, Dec 7, 2017 at 10:18 AM, Susan X. Huynh <xh...@mesosphere.io>
> wrote:
>
>> Sounds strange. Maybe it has to do with the job itself? What kind of job
>> is it? Have you gotten it to run on more than one node before? What's in
>> the spark-submit command?
>>
>> Susan
>>
>> On Wed, Dec 6, 2017 at 11:21 AM, Ji Yan <ji...@drive.ai> wrote:
>>
>>> I am sure that the other agents have plentiful enough resources, but I
>>> don't know why Spark only scheduled executors on one single node, up to
>>> that node's capacity ( it is a different node everytime I run btw ).
>>>
>>> I checked the DEBUG log from Spark Driver, didn't see any mention of
>>> decline. But from log, it looks like it has only accepted one offer from
>>> Mesos.
>>>
>>> Also looks like there is no special role required on Spark part!
>>>
>>> On Wed, Dec 6, 2017 at 5:57 AM, Art Rand <ar...@gmail.com> wrote:
>>>
>>>> Hello Ji,
>>>>
>>>> Spark will launch Executors round-robin on offers, so when the
>>>> resources on an agent get broken into multiple resource offers it's
>>>> possible that many Executrors get placed on a single agent. However, from
>>>> your description, it's not clear why your other agents do not get Executors
>>>> scheduled on them. It's possible that the offers from your other agents are
>>>> insufficient in some way. The Mesos MASTER log should show offers being
>>>> declined by your Spark Driver, do you see that?  If you have DEBUG level
>>>> logging in your Spark driver you should also see offers being declined
>>>> <https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
>>>> there. Finally if your Spark framework isn't receiving any resource offers,
>>>> it could be because of the roles you have established on your agents or
>>>> quota set on other frameworks, have you set up any of that? Hope this helps!
>>>>
>>>> Art
>>>>
>>>> On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job
>>>>> onto several nodes. I try to set the number of executors by the formula
>>>>> (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
>>>>> will try to fill up on one mesos node as many executors as it can, then it
>>>>> stops going to other mesos nodes despite that it has not done scheduling
>>>>> all the executors I have asked it to yet! This is super weird!
>>>>>
>>>>> Did anyone notice this behavior before? Any help appreciated!
>>>>>
>>>>> Ji
>>>>>
>>>>> The information in this email is confidential and may be legally
>>>>> privileged. It is intended solely for the addressee. Access to this email
>>>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>>>> disclosure, copying, distribution or any action taken or omitted to be
>>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>>
>>>>
>>>>
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>>
>> --
>> Susan X. Huynh
>> Software engineer, Data Agility
>> xhuynh@mesosphere.com
>>
>
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.

Re: Spark job only starts tasks on a single node

Posted by Art Rand <ar...@gmail.com>.
Sounds a little like the driver got one offer when it was using zero
resources, then it's not getting any more. How many frameworks (and which)
are running on the cluster? The Mesos Master log should say which
frameworks are getting offers, and should help diagnose the problem.

A

On Thu, Dec 7, 2017 at 10:18 AM, Susan X. Huynh <xh...@mesosphere.io>
wrote:

> Sounds strange. Maybe it has to do with the job itself? What kind of job
> is it? Have you gotten it to run on more than one node before? What's in
> the spark-submit command?
>
> Susan
>
> On Wed, Dec 6, 2017 at 11:21 AM, Ji Yan <ji...@drive.ai> wrote:
>
>> I am sure that the other agents have plentiful enough resources, but I
>> don't know why Spark only scheduled executors on one single node, up to
>> that node's capacity ( it is a different node everytime I run btw ).
>>
>> I checked the DEBUG log from Spark Driver, didn't see any mention of
>> decline. But from log, it looks like it has only accepted one offer from
>> Mesos.
>>
>> Also looks like there is no special role required on Spark part!
>>
>> On Wed, Dec 6, 2017 at 5:57 AM, Art Rand <ar...@gmail.com> wrote:
>>
>>> Hello Ji,
>>>
>>> Spark will launch Executors round-robin on offers, so when the resources
>>> on an agent get broken into multiple resource offers it's possible that
>>> many Executrors get placed on a single agent. However, from your
>>> description, it's not clear why your other agents do not get Executors
>>> scheduled on them. It's possible that the offers from your other agents are
>>> insufficient in some way. The Mesos MASTER log should show offers being
>>> declined by your Spark Driver, do you see that?  If you have DEBUG level
>>> logging in your Spark driver you should also see offers being declined
>>> <https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
>>> there. Finally if your Spark framework isn't receiving any resource offers,
>>> it could be because of the roles you have established on your agents or
>>> quota set on other frameworks, have you set up any of that? Hope this helps!
>>>
>>> Art
>>>
>>> On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job
>>>> onto several nodes. I try to set the number of executors by the formula
>>>> (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
>>>> will try to fill up on one mesos node as many executors as it can, then it
>>>> stops going to other mesos nodes despite that it has not done scheduling
>>>> all the executors I have asked it to yet! This is super weird!
>>>>
>>>> Did anyone notice this behavior before? Any help appreciated!
>>>>
>>>> Ji
>>>>
>>>> The information in this email is confidential and may be legally
>>>> privileged. It is intended solely for the addressee. Access to this email
>>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>>> disclosure, copying, distribution or any action taken or omitted to be
>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>
>>>
>>>
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>
>
> --
> Susan X. Huynh
> Software engineer, Data Agility
> xhuynh@mesosphere.com
>

Re: Spark job only starts tasks on a single node

Posted by "Susan X. Huynh" <xh...@mesosphere.io>.
Sounds strange. Maybe it has to do with the job itself? What kind of job is
it? Have you gotten it to run on more than one node before? What's in the
spark-submit command?

Susan

On Wed, Dec 6, 2017 at 11:21 AM, Ji Yan <ji...@drive.ai> wrote:

> I am sure that the other agents have plentiful enough resources, but I
> don't know why Spark only scheduled executors on one single node, up to
> that node's capacity ( it is a different node everytime I run btw ).
>
> I checked the DEBUG log from Spark Driver, didn't see any mention of
> decline. But from log, it looks like it has only accepted one offer from
> Mesos.
>
> Also looks like there is no special role required on Spark part!
>
> On Wed, Dec 6, 2017 at 5:57 AM, Art Rand <ar...@gmail.com> wrote:
>
>> Hello Ji,
>>
>> Spark will launch Executors round-robin on offers, so when the resources
>> on an agent get broken into multiple resource offers it's possible that
>> many Executrors get placed on a single agent. However, from your
>> description, it's not clear why your other agents do not get Executors
>> scheduled on them. It's possible that the offers from your other agents are
>> insufficient in some way. The Mesos MASTER log should show offers being
>> declined by your Spark Driver, do you see that?  If you have DEBUG level
>> logging in your Spark driver you should also see offers being declined
>> <https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
>> there. Finally if your Spark framework isn't receiving any resource offers,
>> it could be because of the roles you have established on your agents or
>> quota set on other frameworks, have you set up any of that? Hope this helps!
>>
>> Art
>>
>> On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:
>>
>>> Hi all,
>>>
>>> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job
>>> onto several nodes. I try to set the number of executors by the formula
>>> (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
>>> will try to fill up on one mesos node as many executors as it can, then it
>>> stops going to other mesos nodes despite that it has not done scheduling
>>> all the executors I have asked it to yet! This is super weird!
>>>
>>> Did anyone notice this behavior before? Any help appreciated!
>>>
>>> Ji
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>



-- 
Susan X. Huynh
Software engineer, Data Agility
xhuynh@mesosphere.com

Re: Spark job only starts tasks on a single node

Posted by Ji Yan <ji...@drive.ai>.
I am sure that the other agents have plentiful enough resources, but I
don't know why Spark only scheduled executors on one single node, up to
that node's capacity ( it is a different node everytime I run btw ).

I checked the DEBUG log from Spark Driver, didn't see any mention of
decline. But from log, it looks like it has only accepted one offer from
Mesos.

Also looks like there is no special role required on Spark part!

On Wed, Dec 6, 2017 at 5:57 AM, Art Rand <ar...@gmail.com> wrote:

> Hello Ji,
>
> Spark will launch Executors round-robin on offers, so when the resources
> on an agent get broken into multiple resource offers it's possible that
> many Executrors get placed on a single agent. However, from your
> description, it's not clear why your other agents do not get Executors
> scheduled on them. It's possible that the offers from your other agents are
> insufficient in some way. The Mesos MASTER log should show offers being
> declined by your Spark Driver, do you see that?  If you have DEBUG level
> logging in your Spark driver you should also see offers being declined
> <https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
> there. Finally if your Spark framework isn't receiving any resource offers,
> it could be because of the roles you have established on your agents or
> quota set on other frameworks, have you set up any of that? Hope this helps!
>
> Art
>
> On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:
>
>> Hi all,
>>
>> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto
>> several nodes. I try to set the number of executors by the formula
>> (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
>> will try to fill up on one mesos node as many executors as it can, then it
>> stops going to other mesos nodes despite that it has not done scheduling
>> all the executors I have asked it to yet! This is super weird!
>>
>> Did anyone notice this behavior before? Any help appreciated!
>>
>> Ji
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.

Re: Spark job only starts tasks on a single node

Posted by Art Rand <ar...@gmail.com>.
Hello Ji,

Spark will launch Executors round-robin on offers, so when the resources on
an agent get broken into multiple resource offers it's possible that many
Executrors get placed on a single agent. However, from your description,
it's not clear why your other agents do not get Executors scheduled on
them. It's possible that the offers from your other agents are insufficient
in some way. The Mesos MASTER log should show offers being declined by your
Spark Driver, do you see that?  If you have DEBUG level logging in your
Spark driver you should also see offers being declined
<https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
there. Finally if your Spark framework isn't receiving any resource offers,
it could be because of the roles you have established on your agents or
quota set on other frameworks, have you set up any of that? Hope this helps!

Art

On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:

> Hi all,
>
> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto
> several nodes. I try to set the number of executors by the formula
> (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
> will try to fill up on one mesos node as many executors as it can, then it
> stops going to other mesos nodes despite that it has not done scheduling
> all the executors I have asked it to yet! This is super weird!
>
> Did anyone notice this behavior before? Any help appreciated!
>
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>