You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by captainfranz <ca...@gmail.com> on 2016/05/02 18:21:36 UTC

Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in
standalone mode.
Are worker processes and executors independent JVMs or do executors run
within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
worker (--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-workers-executors-and-JVMs-tp26860.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark standalone workers, executors and JVMs

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

More cores without getting memory per core ratio correct can result in more
queuing and hence more contention as was evident from the earlier published
results

I had a bit of discussion with one of the spark experts who stated/claimed
one should have one executor per server and then get parallelism via the
number of cores but I am not still convinced. I would still go for multiple
containers on the master/primary. The crucial factor here is memory per
core as I understand from tests adding more CPUs/cores above 80%
utilization on existing CPUs/core is the rule of thumb. Unless one is
getting those numbers after optimizing parallelism, adding more and more
cores is redundant so the ratio of memory per core becomes very relevant.
Of course If one could switch to faster CPUs and keep all other factors the
same, I expect one would see better performance immediately and again that
comes with a cost. If the entire box is busy, adding enough cores to keep
YARN in its own little world without cores having to fight other OS
processes for cores should help. Having said that in my opinion more
CPUs/cores is *not* better unless you have serious CPU contention in the
first place. Faster is better.

Back to your points if you run 6 workers with 32GB  and assuming 24 cores,
we are talking about each worker being allocated 24 cores and the ratio of
memory per core of 32/24 does not sound that great. An alternative approach
would be to have that number of workers reduced to 4 to give a better ratio
of 48/24.

Your mileage will vary depending on what the application will be doing. I
would just test it to get the best fit.


HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 4 May 2016 at 15:39, Simone Franzini <ca...@gmail.com> wrote:

> Hi Mohammed,
>
> Thanks for your reply. I agree with you, however a single application can
> use multiple executors as well, so I am still not clear which option is
> best. Let me make an example to be a little more concrete.
>
> Let's say I am only running a single application. Let's assume again that
> I have 192GB of memory and 24 cores on each node. Which one of the
> following two options is best and why:
> 1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set
> SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which
> is to assign all available cores to an executor in standalone mode).
> 2. Running 1 worker with 192GB memory and 6 executors/worker (i.e.
> SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5,
> spark.executor.memory=32GB).
>
> Also one more question. I understand that workers and executors are
> different processes. How many resources is the worker process actually
> using and how do I set those? As far as I understand the worker does not
> need many resources, as it is only spawning up executors. Is that correct?
>
> Thanks,
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
>> The workers and executors run as separate JVM processes in the standalone
>> mode.
>>
>>
>>
>> The use of multiple workers on a single machine depends on how you will
>> be using the clusters. If you run multiple Spark applications
>> simultaneously, each application gets its own its executor. So, for
>> example, if you allocate 8GB to each application, you can run 192/8 Spark
>> applications simultaneously (assuming you also have large number of cores).
>> Each executor has only 8GB heap, so GC should not be issue. Alternatively,
>> if you know that you will have few applications running simultaneously on
>> that cluster, running multiple workers on each machine will allow you to
>> avoid GC issues associated with allocating large heap to a single JVM
>> process. This option allows you to run multiple executors for an
>> application on a single machine and each executor can be configured with
>> optimal memory.
>>
>>
>>
>>
>>
>> Mohammed
>>
>> Author: Big Data Analytics with Spark
>> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>>
>>
>>
>> *From:* Simone Franzini [mailto:captainfranz@gmail.com]
>> *Sent:* Monday, May 2, 2016 9:27 AM
>> *To:* user
>> *Subject:* Fwd: Spark standalone workers, executors and JVMs
>>
>>
>>
>> I am still a little bit confused about workers, executors and JVMs in
>> standalone mode.
>>
>> Are worker processes and executors independent JVMs or do executors run
>> within the worker JVM?
>>
>> I have some memory-rich nodes (192GB) and I would like to avoid deploying
>> massive JVMs due to well known performance issues (GC and such).
>>
>> As of Spark 1.4 it is possible to either deploy multiple workers
>> (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
>> worker (--executor-cores). Which option is preferable and why?
>>
>>
>>
>> Thanks,
>>
>> Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>>
>>
>
>

RE: Spark standalone workers, executors and JVMs

Posted by Mohammed Guller <mo...@glassbeam.com>.

Spark allows you configure the resources for the worker process. If I remember it correctly, you can use SPARK_DAEMON_MEMORY to control memory allocated to the worker process.

#1 below is more appropriate if you will be running just one application at a time. 32GB heap size is still too high. Depending on the garbage collector, you may see long pauses.

Mohammed
Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Simone Franzini [mailto:captainfranz@gmail.com]
Sent: Wednesday, May 4, 2016 7:40 AM
To: user
Subject: Re: Spark standalone workers, executors and JVMs

Hi Mohammed,

Thanks for your reply. I agree with you, however a single application can use multiple executors as well, so I am still not clear which option is best. Let me make an example to be a little more concrete.

Let's say I am only running a single application. Let's assume again that I have 192GB of memory and 24 cores on each node. Which one of the following two options is best and why:
1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which is to assign all available cores to an executor in standalone mode).
2. Running 1 worker with 192GB memory and 6 executors/worker (i.e. SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5, spark.executor.memory=32GB).

Also one more question. I understand that workers and executors are different processes. How many resources is the worker process actually using and how do I set those? As far as I understand the worker does not need many resources, as it is only spawning up executors. Is that correct?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller <mo...@glassbeam.com>> wrote:
The workers and executors run as separate JVM processes in the standalone mode.

The use of multiple workers on a single machine depends on how you will be using the clusters. If you run multiple Spark applications simultaneously, each application gets its own its executor. So, for example, if you allocate 8GB to each application, you can run 192/8 Spark applications simultaneously (assuming you also have large number of cores). Each executor has only 8GB heap, so GC should not be issue. Alternatively, if you know that you will have few applications running simultaneously on that cluster, running multiple workers on each machine will allow you to avoid GC issues associated with allocating large heap to a single JVM process. This option allows you to run multiple executors for an application on a single machine and each executor can be configured with optimal memory.

Mohammed
Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Simone Franzini [mailto:captainfranz@gmail.com<ma...@gmail.com>]
Sent: Monday, May 2, 2016 9:27 AM
To: user
Subject: Fwd: Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in standalone mode.
Are worker processes and executors independent JVMs or do executors run within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker (--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

Re: IS spark have CapacityScheduler?

Posted by Ted Yu <yu...@gmail.com>.

Cycling old bits:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scheduling-with-Capacity-scheduler-td10038.html

On Wed, May 4, 2016 at 7:44 AM, 开心延年 <mu...@qq.com> wrote:

> Scheduling Within an Application
>
> I found FAIRSchedule,but is there som exampe implements like yarn
> CapacityScheduler?
>
>
> <?xml version="1.0"?><allocations>
>   <pool name="production">
>     <schedulingMode>FAIR</schedulingMode>
>     <weight>1</weight>
>     <minShare>2</minShare>
>   </pool>
>   <pool name="test">
>     <schedulingMode>FIFO</schedulingMode>
>     <weight>2</weight>
>     <minShare>3</minShare>
>   </pool></allocations>
>
>

IS spark have CapacityScheduler?

Posted by 开心延年 <mu...@qq.com>.

Scheduling Within an Application

I found FAIRSchedule,but is there som exampe implements like yarn CapacityScheduler?



<?xml version="1.0"?> <allocations>   <pool name="production">     <schedulingMode>FAIR</schedulingMode>     <weight>1</weight>     <minShare>2</minShare>   </pool>   <pool name="test">     <schedulingMode>FIFO</schedulingMode>     <weight>2</weight>     <minShare>3</minShare>   </pool> </allocations>

Re: Spark standalone workers, executors and JVMs

Posted by Simone Franzini <ca...@gmail.com>.

Hi Mohammed,

Thanks for your reply. I agree with you, however a single application can
use multiple executors as well, so I am still not clear which option is
best. Let me make an example to be a little more concrete.

Let's say I am only running a single application. Let's assume again that I
have 192GB of memory and 24 cores on each node. Which one of the following
two options is best and why:
1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set
SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which
is to assign all available cores to an executor in standalone mode).
2. Running 1 worker with 192GB memory and 6 executors/worker (i.e.
SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5,
spark.executor.memory=32GB).

Also one more question. I understand that workers and executors are
different processes. How many resources is the worker process actually
using and how do I set those? As far as I understand the worker does not
need many resources, as it is only spawning up executors. Is that correct?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller <mo...@glassbeam.com>
wrote:

> The workers and executors run as separate JVM processes in the standalone
> mode.
>
>
>
> The use of multiple workers on a single machine depends on how you will be
> using the clusters. If you run multiple Spark applications simultaneously,
> each application gets its own its executor. So, for example, if you
> allocate 8GB to each application, you can run 192/8 Spark applications
> simultaneously (assuming you also have large number of cores). Each
> executor has only 8GB heap, so GC should not be issue. Alternatively, if
> you know that you will have few applications running simultaneously on that
> cluster, running multiple workers on each machine will allow you to avoid
> GC issues associated with allocating large heap to a single JVM process.
> This option allows you to run multiple executors for an application on a
> single machine and each executor can be configured with optimal memory.
>
>
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Simone Franzini [mailto:captainfranz@gmail.com]
> *Sent:* Monday, May 2, 2016 9:27 AM
> *To:* user
> *Subject:* Fwd: Spark standalone workers, executors and JVMs
>
>
>
> I am still a little bit confused about workers, executors and JVMs in
> standalone mode.
>
> Are worker processes and executors independent JVMs or do executors run
> within the worker JVM?
>
> I have some memory-rich nodes (192GB) and I would like to avoid deploying
> massive JVMs due to well known performance issues (GC and such).
>
> As of Spark 1.4 it is possible to either deploy multiple workers
> (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
> worker (--executor-cores). Which option is preferable and why?
>
>
>
> Thanks,
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
>

RE: Spark standalone workers, executors and JVMs

Posted by Mohammed Guller <mo...@glassbeam.com>.

The workers and executors run as separate JVM processes in the standalone mode.

The use of multiple workers on a single machine depends on how you will be using the clusters. If you run multiple Spark applications simultaneously, each application gets its own its executor. So, for example, if you allocate 8GB to each application, you can run 192/8 Spark applications simultaneously (assuming you also have large number of cores). Each executor has only 8GB heap, so GC should not be issue. Alternatively, if you know that you will have few applications running simultaneously on that cluster, running multiple workers on each machine will allow you to avoid GC issues associated with allocating large heap to a single JVM process. This option allows you to run multiple executors for an application on a single machine and each executor can be configured with optimal memory.


Mohammed
Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Simone Franzini [mailto:captainfranz@gmail.com]
Sent: Monday, May 2, 2016 9:27 AM
To: user
Subject: Fwd: Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in standalone mode.
Are worker processes and executors independent JVMs or do executors run within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker (--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

Fwd: Spark standalone workers, executors and JVMs

Posted by Simone Franzini <ca...@gmail.com>.

I am still a little bit confused about workers, executors and JVMs in
standalone mode.
Are worker processes and executors independent JVMs or do executors run
within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
worker (--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini