You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Josh Devins <jo...@soundcloud.com> on 2014/12/22 17:23:57 UTC

Mesos resource allocation

We are experimenting with running Spark on Mesos after running
successfully in Standalone mode for a few months. With the Standalone
resource manager (as well as YARN), you have the option to define the
number of cores, number of executors and memory per executor. In
Mesos, however, it appears as though you cannot specify the number of
executors, even in coarse-grained mode. If this is the case, how do
you define the number of executors to run with?

Here's an example of why this matters (to us). Let's say we have the
following cluster:

num nodes: 8
num cores: 256 (32 per node)
total memory: 512GB (64GB per node)

If I set my job to require 256 cores and per-executor-memory to 30GB,
then Mesos will schedule a single executor per machine (8 executors
total) and each executor will get 32 cores to work with. This means
that we have 8 executors * 32GB each for a total of 240G of cluster
memory in use — less than half of what is available. If you want
actually 16 executors in order to increase the amount of memory in use
across the cluster, how can you do this with Mesos? It seems that a
parameter is missing (or I haven't found it yet) which lets me tune
this for Mesos:
 * number of executors per n-cores OR
 * number of executors total

Furthermore, in fine-grained mode in Mesos, how are the executors
started/allocated? That is, since Spark tasks map to Mesos tasks, when
and how are executors started? If they are transient and an executor
per task is created, does this mean we cannot have cached RDDs?

Thanks for any advice or pointers,

Josh

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Mesos resource allocation

Posted by Josh Devins <jo...@soundcloud.com>.

Yes, there is a thread on the Spark list to continue this topic and as you
mention, it's up to Spark to define this.

Josh
 On 5 Jan 2015 21:29, "Benjamin Mahler" <be...@gmail.com> wrote:

> Did you find what you were looking for?
>
> At a quick glance, this kind of configuration is left to the framework
> (Spark).
> Mesos doesn't make any decisions with respect to how the resources offered
> to Spark are being used.
>
> On Tue, Dec 23, 2014 at 5:44 AM, Josh Devins <jo...@soundcloud.com> wrote:
>
>> Cross-posting to see if someone with Mesos experience can help with
>> this Spark-Mesos question (below).
>>
>>
>> ---------- Forwarded message ----------
>> From: Josh Devins <jo...@soundcloud.com>
>> Date: 22 December 2014 at 17:23
>> Subject: Mesos resource allocation
>> To: user@spark.apache.org
>>
>>
>> We are experimenting with running Spark on Mesos after running
>> successfully in Standalone mode for a few months. With the Standalone
>> resource manager (as well as YARN), you have the option to define the
>> number of cores, number of executors and memory per executor. In
>> Mesos, however, it appears as though you cannot specify the number of
>> executors, even in coarse-grained mode. If this is the case, how do
>> you define the number of executors to run with?
>>
>> Here's an example of why this matters (to us). Let's say we have the
>> following cluster:
>>
>> num nodes: 8
>> num cores: 256 (32 per node)
>> total memory: 512GB (64GB per node)
>>
>> If I set my job to require 256 cores and per-executor-memory to 30GB,
>> then Mesos will schedule a single executor per machine (8 executors
>> total) and each executor will get 32 cores to work with. This means
>> that we have 8 executors * 32GB each for a total of 240G of cluster
>> memory in use — less than half of what is available. If you want
>> actually 16 executors in order to increase the amount of memory in use
>> across the cluster, how can you do this with Mesos? It seems that a
>> parameter is missing (or I haven't found it yet) which lets me tune
>> this for Mesos:
>>  * number of executors per n-cores OR
>>  * number of executors total
>>
>> Furthermore, in fine-grained mode in Mesos, how are the executors
>> started/allocated? That is, since Spark tasks map to Mesos tasks, when
>> and how are executors started? If they are transient and an executor
>> per task is created, does this mean we cannot have cached RDDs?
>>
>> Thanks for any advice or pointers,
>>
>> Josh
>>
>
>

Re: Mesos resource allocation

Posted by Benjamin Mahler <be...@gmail.com>.

Did you find what you were looking for?

At a quick glance, this kind of configuration is left to the framework
(Spark).
Mesos doesn't make any decisions with respect to how the resources offered
to Spark are being used.

On Tue, Dec 23, 2014 at 5:44 AM, Josh Devins <jo...@soundcloud.com> wrote:

> Cross-posting to see if someone with Mesos experience can help with
> this Spark-Mesos question (below).
>
>
> ---------- Forwarded message ----------
> From: Josh Devins <jo...@soundcloud.com>
> Date: 22 December 2014 at 17:23
> Subject: Mesos resource allocation
> To: user@spark.apache.org
>
>
> We are experimenting with running Spark on Mesos after running
> successfully in Standalone mode for a few months. With the Standalone
> resource manager (as well as YARN), you have the option to define the
> number of cores, number of executors and memory per executor. In
> Mesos, however, it appears as though you cannot specify the number of
> executors, even in coarse-grained mode. If this is the case, how do
> you define the number of executors to run with?
>
> Here's an example of why this matters (to us). Let's say we have the
> following cluster:
>
> num nodes: 8
> num cores: 256 (32 per node)
> total memory: 512GB (64GB per node)
>
> If I set my job to require 256 cores and per-executor-memory to 30GB,
> then Mesos will schedule a single executor per machine (8 executors
> total) and each executor will get 32 cores to work with. This means
> that we have 8 executors * 32GB each for a total of 240G of cluster
> memory in use — less than half of what is available. If you want
> actually 16 executors in order to increase the amount of memory in use
> across the cluster, how can you do this with Mesos? It seems that a
> parameter is missing (or I haven't found it yet) which lets me tune
> this for Mesos:
>  * number of executors per n-cores OR
>  * number of executors total
>
> Furthermore, in fine-grained mode in Mesos, how are the executors
> started/allocated? That is, since Spark tasks map to Mesos tasks, when
> and how are executors started? If they are transient and an executor
> per task is created, does this mean we cannot have cached RDDs?
>
> Thanks for any advice or pointers,
>
> Josh
>

Re: Mesos resource allocation

Posted by Tim Chen <ti...@mesosphere.io>.

Hi Josh,

I see, I haven't heard folks using larger JVM heap size than you mentioned
(30gb), but in your scenario what you're proposing does make sense.

I've created SPARK-5095 and we can continue our discussion about how to
address this.

Tim


On Mon, Jan 5, 2015 at 1:22 AM, Josh Devins <jo...@soundcloud.com> wrote:

> Hey Tim, sorry for the delayed reply, been on vacation for a couple weeks.
>
> The reason we want to control the number of executors is that running
> executors with JVM heaps over 30GB causes significant garbage
> collection problems. We have observed this through much
> trial-and-error for jobs that are a dozen-or-so stages, running for
> more than ~20m. For example, if we run 8 executors with 60GB heap each
> (for example, we have also other values larger than 30GB), even after
> much tuning of heap parameters (% for RDD cache, etc.) we run into GC
> problems. Effectively GC becomes so high that it takes over all
> compute time from the JVM. If we then halve the heap (30GB) but double
> the number of executors (16), all GC problems are relieved and we get
> to use the full memory resources of the cluster. We talked with some
> engineers from Databricks at Strata in Barcelona recently and received
> the same advice — do not run executors with more than 30GB heaps.
> Since our machines are 64GB machines and we are typically only running
> one or two jobs at a time on the cluster (for now), we can only use
> half the cluster memory with the current configuration options
> available in Mesos.
>
> Happy to hear your thoughts and actually very curious about how others
> are running Spark on Mesos with large heaps (as a result of large
> memory machines). Perhaps this is a non-issue when we have more
> multi-tenancy in the cluster, but for now, this is not the case.
>
> Thanks,
>
> Josh
>
>
> On 24 December 2014 at 06:22, Tim Chen <ti...@mesosphere.io> wrote:
> >
> > Hi Josh,
> >
> > If you want to cap the amount of memory per executor in Coarse grain
> mode, then yes you only get 240GB of memory as you mentioned. What's the
> reason you don't want to raise the capacity of memory you use per executor?
> >
> > In coarse grain mode the Spark executor is long living and it internally
> will get tasks distributed by Spark internal Coarse grained scheduler. I
> think the assumption is that it already allocated the maximum available on
> that slave and don't really assume we need another one.
> >
> > I think it's worth considering having a configuration of number of cores
> per executor, especially when Mesos have inverse offers and optimistic
> offers so we can choose to launch more executors when resources becomes
> available even in coarse grain mode and then support giving the executors
> back but more higher priority tasks arrive.
> >
> > For fine grain mode, the spark executors are started by Mesos executors
> that is configured from Mesos scheduler backend. I believe the RDD is
> cached as long as the Mesos executor is running as the BlockManager is
> created on executor registration.
> >
> > Let me know if you need any more info.
> >
> > Tim
> >
> >
> >>
> >> ---------- Forwarded message ----------
> >> From: Josh Devins <jo...@soundcloud.com>
> >> Date: 22 December 2014 at 17:23
> >> Subject: Mesos resource allocation
> >> To: user@spark.apache.org
> >>
> >>
> >> We are experimenting with running Spark on Mesos after running
> >> successfully in Standalone mode for a few months. With the Standalone
> >> resource manager (as well as YARN), you have the option to define the
> >> number of cores, number of executors and memory per executor. In
> >> Mesos, however, it appears as though you cannot specify the number of
> >> executors, even in coarse-grained mode. If this is the case, how do
> >> you define the number of executors to run with?
> >>
> >> Here's an example of why this matters (to us). Let's say we have the
> >> following cluster:
> >>
> >> num nodes: 8
> >> num cores: 256 (32 per node)
> >> total memory: 512GB (64GB per node)
> >>
> >> If I set my job to require 256 cores and per-executor-memory to 30GB,
> >> then Mesos will schedule a single executor per machine (8 executors
> >> total) and each executor will get 32 cores to work with. This means
> >> that we have 8 executors * 32GB each for a total of 240G of cluster
> >> memory in use — less than half of what is available. If you want
> >> actually 16 executors in order to increase the amount of memory in use
> >> across the cluster, how can you do this with Mesos? It seems that a
> >> parameter is missing (or I haven't found it yet) which lets me tune
> >> this for Mesos:
> >>  * number of executors per n-cores OR
> >>  * number of executors total
> >>
> >> Furthermore, in fine-grained mode in Mesos, how are the executors
> >> started/allocated? That is, since Spark tasks map to Mesos tasks, when
> >> and how are executors started? If they are transient and an executor
> >> per task is created, does this mean we cannot have cached RDDs?
> >>
> >> Thanks for any advice or pointers,
> >>
> >> Josh
> >
> >
> >
>

Re: Mesos resource allocation

Posted by Josh Devins <jo...@soundcloud.com>.

Hey Tim, sorry for the delayed reply, been on vacation for a couple weeks.

The reason we want to control the number of executors is that running
executors with JVM heaps over 30GB causes significant garbage
collection problems. We have observed this through much
trial-and-error for jobs that are a dozen-or-so stages, running for
more than ~20m. For example, if we run 8 executors with 60GB heap each
(for example, we have also other values larger than 30GB), even after
much tuning of heap parameters (% for RDD cache, etc.) we run into GC
problems. Effectively GC becomes so high that it takes over all
compute time from the JVM. If we then halve the heap (30GB) but double
the number of executors (16), all GC problems are relieved and we get
to use the full memory resources of the cluster. We talked with some
engineers from Databricks at Strata in Barcelona recently and received
the same advice — do not run executors with more than 30GB heaps.
Since our machines are 64GB machines and we are typically only running
one or two jobs at a time on the cluster (for now), we can only use
half the cluster memory with the current configuration options
available in Mesos.

Happy to hear your thoughts and actually very curious about how others
are running Spark on Mesos with large heaps (as a result of large
memory machines). Perhaps this is a non-issue when we have more
multi-tenancy in the cluster, but for now, this is not the case.

Thanks,

Josh

On 24 December 2014 at 06:22, Tim Chen <ti...@mesosphere.io> wrote:
>
> Hi Josh,
>
> If you want to cap the amount of memory per executor in Coarse grain mode, then yes you only get 240GB of memory as you mentioned. What's the reason you don't want to raise the capacity of memory you use per executor?
>
> In coarse grain mode the Spark executor is long living and it internally will get tasks distributed by Spark internal Coarse grained scheduler. I think the assumption is that it already allocated the maximum available on that slave and don't really assume we need another one.
>
> I think it's worth considering having a configuration of number of cores per executor, especially when Mesos have inverse offers and optimistic offers so we can choose to launch more executors when resources becomes available even in coarse grain mode and then support giving the executors back but more higher priority tasks arrive.
>
> For fine grain mode, the spark executors are started by Mesos executors that is configured from Mesos scheduler backend. I believe the RDD is cached as long as the Mesos executor is running as the BlockManager is created on executor registration.
>
> Let me know if you need any more info.
>
> Tim
>
>
>>
>> ---------- Forwarded message ----------
>> From: Josh Devins <jo...@soundcloud.com>
>> Date: 22 December 2014 at 17:23
>> Subject: Mesos resource allocation
>> To: user@spark.apache.org
>>
>>
>> We are experimenting with running Spark on Mesos after running
>> successfully in Standalone mode for a few months. With the Standalone
>> resource manager (as well as YARN), you have the option to define the
>> number of cores, number of executors and memory per executor. In
>> Mesos, however, it appears as though you cannot specify the number of
>> executors, even in coarse-grained mode. If this is the case, how do
>> you define the number of executors to run with?
>>
>> Here's an example of why this matters (to us). Let's say we have the
>> following cluster:
>>
>> num nodes: 8
>> num cores: 256 (32 per node)
>> total memory: 512GB (64GB per node)
>>
>> If I set my job to require 256 cores and per-executor-memory to 30GB,
>> then Mesos will schedule a single executor per machine (8 executors
>> total) and each executor will get 32 cores to work with. This means
>> that we have 8 executors * 32GB each for a total of 240G of cluster
>> memory in use — less than half of what is available. If you want
>> actually 16 executors in order to increase the amount of memory in use
>> across the cluster, how can you do this with Mesos? It seems that a
>> parameter is missing (or I haven't found it yet) which lets me tune
>> this for Mesos:
>>  * number of executors per n-cores OR
>>  * number of executors total
>>
>> Furthermore, in fine-grained mode in Mesos, how are the executors
>> started/allocated? That is, since Spark tasks map to Mesos tasks, when
>> and how are executors started? If they are transient and an executor
>> per task is created, does this mean we cannot have cached RDDs?
>>
>> Thanks for any advice or pointers,
>>
>> Josh
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Fwd: Mesos resource allocation

Posted by Tim Chen <ti...@mesosphere.io>.

Hi Josh,

If you want to cap the amount of memory per executor in Coarse grain mode,
then yes you only get 240GB of memory as you mentioned. What's the reason
you don't want to raise the capacity of memory you use per executor?

In coarse grain mode the Spark executor is long living and it internally
will get tasks distributed by Spark internal Coarse grained scheduler. I
think the assumption is that it already allocated the maximum available on
that slave and don't really assume we need another one.

I think it's worth considering having a configuration of number of cores
per executor, especially when Mesos have inverse offers and optimistic
offers so we can choose to launch more executors when resources becomes
available even in coarse grain mode and then support giving the executors
back but more higher priority tasks arrive.

For fine grain mode, the spark executors are started by Mesos executors
that is configured from Mesos scheduler backend. I believe the RDD is
cached as long as the Mesos executor is running as the BlockManager is
created on executor registration.

Let me know if you need any more info.

Tim



> ---------- Forwarded message ----------
> From: Josh Devins <jo...@soundcloud.com>
> Date: 22 December 2014 at 17:23
> Subject: Mesos resource allocation
> To: user@spark.apache.org
>
>
> We are experimenting with running Spark on Mesos after running
> successfully in Standalone mode for a few months. With the Standalone
> resource manager (as well as YARN), you have the option to define the
> number of cores, number of executors and memory per executor. In
> Mesos, however, it appears as though you cannot specify the number of
> executors, even in coarse-grained mode. If this is the case, how do
> you define the number of executors to run with?
>
> Here's an example of why this matters (to us). Let's say we have the
> following cluster:
>
> num nodes: 8
> num cores: 256 (32 per node)
> total memory: 512GB (64GB per node)
>
> If I set my job to require 256 cores and per-executor-memory to 30GB,
> then Mesos will schedule a single executor per machine (8 executors
> total) and each executor will get 32 cores to work with. This means
> that we have 8 executors * 32GB each for a total of 240G of cluster
> memory in use — less than half of what is available. If you want
> actually 16 executors in order to increase the amount of memory in use
> across the cluster, how can you do this with Mesos? It seems that a
> parameter is missing (or I haven't found it yet) which lets me tune
> this for Mesos:
>  * number of executors per n-cores OR
>  * number of executors total
>
> Furthermore, in fine-grained mode in Mesos, how are the executors
> started/allocated? That is, since Spark tasks map to Mesos tasks, when
> and how are executors started? If they are transient and an executor
> per task is created, does this mean we cannot have cached RDDs?
>
> Thanks for any advice or pointers,
>
> Josh
>

Fwd: Mesos resource allocation

Posted by Josh Devins <jo...@soundcloud.com>.

Cross-posting to see if someone with Mesos experience can help with
this Spark-Mesos question (below).

---------- Forwarded message ----------
From: Josh Devins <jo...@soundcloud.com>
Date: 22 December 2014 at 17:23
Subject: Mesos resource allocation
To: user@spark.apache.org

We are experimenting with running Spark on Mesos after running
successfully in Standalone mode for a few months. With the Standalone
resource manager (as well as YARN), you have the option to define the
number of cores, number of executors and memory per executor. In
Mesos, however, it appears as though you cannot specify the number of
executors, even in coarse-grained mode. If this is the case, how do
you define the number of executors to run with?

Here's an example of why this matters (to us). Let's say we have the
following cluster:

num nodes: 8
num cores: 256 (32 per node)
total memory: 512GB (64GB per node)

If I set my job to require 256 cores and per-executor-memory to 30GB,
then Mesos will schedule a single executor per machine (8 executors
total) and each executor will get 32 cores to work with. This means
that we have 8 executors * 32GB each for a total of 240G of cluster
memory in use — less than half of what is available. If you want
actually 16 executors in order to increase the amount of memory in use
across the cluster, how can you do this with Mesos? It seems that a
parameter is missing (or I haven't found it yet) which lets me tune
this for Mesos:
 * number of executors per n-cores OR
 * number of executors total

Furthermore, in fine-grained mode in Mesos, how are the executors
started/allocated? That is, since Spark tasks map to Mesos tasks, when
and how are executors started? If they are transient and an executor
per task is created, does this mean we cannot have cached RDDs?

Thanks for any advice or pointers,

Josh