You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2016/06/16 11:37:44 UTC

[YARN] Questions about YARN's queues and Spark's FAIR scheduler

Hi,

I'm trying to get my head around the different parts of Spark on YARN
architecture with YARN's schedulers and queues as well as Spark's own
schedulers - FAIR and FIFO.

I'd appreciate if you could read how I see things and correct me where
I'm wrong. Thanks!

The default scheduler in YARN is Capacity Scheduler [1]. It comes with
the notion of queues. When you spark-submit a Spark application with
--master yarn, you can specify --queue for the scheduling queue and it
is **only** to offer the right share of CPUs and memory to the
application. There could be more resources in the cluster, but that
particular queue has only that exact share of vcores and memory.

In other words, Spark does not know about any other resources but the
ones available in the queue.

Is this correct?

You can also spark-submit a Spark application using FAIR scheduler
(the default is FIFO) using -c spark.scheduler.mode=FAIR.

In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
They can also control the resource shares assigned to Spark
jobs/applications. You could sc.setLocalProperty to control what pool
to use.

Is this correct?

If both are yes, why would I want to go as far as using queues and
FAIR scheduling mode with pools? What are the benefits? Is this for
multi-tenant environments? Do you have any use cases that would fit
better with FAIR scheduling mode? What about YARN's queues with Spark
on YARN?

Share as much as you could since the topic bothers me so much (and
without your support I won't be able to recover from this painful
mental state :))

Thanks for reading so far! Appreciate any help.

[1] https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: [YARN] Questions about YARN's queues and Spark's FAIR scheduler

Posted by Mark Hamstra <ma...@clearstorydata.com>.

>
> You can also spark-submit a Spark application using FAIR scheduler
> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>
> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
> They can also control the resource shares assigned to Spark
> jobs/applications. You could sc.setLocalProperty to control what pool
> to use.
>
> Is this correct?


No, this is incorrect.  Spark's fair scheduler is only about controlling
the concurrent scheduling of Jobs within an Application, not about handling
the concurrency of Applications.  You'd use Spark's fair scheduler when you
have a Spark Application that runs many Jobs and you don't just want those
Jobs queuing up in FIFO order but instead want to see interleaving of the
execution of the Tasks for multiple Jobs, according to some fair-ordering
policy.  Just setting spark.scheduler.mode=FAIR isn't enough.  You also
have to configure the scheduling pools that implement your desired policy,
and you need to assign Jobs to those pools before running them -- else the
Jobs all just end up in the default pool, which may be enough in some
simple cases if you have at least configured the default pool.  Spark's
fair scheduler also currently has limited ability to control resource
allocation.  Essentially, all of the cluster resources are available to
every pool, so whichever cluster resource offer is made next can be used by
the Task that has been fair scheduled to the head of the scheduling queue,
regardless of whatever resources that Task's Job is already using.  This
may change in the not-too-distant future (SPARK-15176), but not before
Spark 2.1.  An even larger change would be to allow pre-emption of already
running Tasks if a higher-priority Task becomes runnable under certain
conditions, but I'm not aware of any current efforts to implement that.

Within these constraints, Spark's fair scheduling can and does work very
well in some significant use cases and production environments --
regardless of Mich's skepticism. :)

On Thu, Jun 16, 2016 at 8:26 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Jacek,
>
> Your point
>
> " it could use FIFO or FAIR task scheduling. My question is when would I
> need to use FAIR?... "
>
> Good point and this is my two cents on this.
>
> FAIR scheduling (in the realm of YARN as resource manager) is a method of
> assigning resources to Spark jobs such that all jobs get, on average, *an
> equal share of resources over time*. When there is a single job running
> within a YARN cluster, that job uses the entire cluster.
>
>
>
> Now when other Spark jobs are submitted, tasks slots that free up are
> assigned to the new jobs, so that each job gets roughly the same amount of
> core time. I think in FAIR mode YARN forms a queue of jobs (being FAIR).
> This lets short jobs finish in reasonable time while not starving long
> jobs. It is also a reasonable way to share a cluster between a number of
> users. Finally, FAIR sharing can also work with job priorities. The
> priorities are used as weights to determine the fraction of total compute
> time that each job should get. I have never tried this myself.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 16:11, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> Thanks for your prompt answer.
>>
>> You said "the resource scheduling is handled to YARN" so it's only about
>> vcores and memory, right? Once Spark has the resources (be it as a custom
>> queue in YARN's Capacity Scheduler or default), it could use FIFO or FAIR
>> task scheduling. My question is when would I need to use FAIR? Is this
>> about TaskSetManagers (that represent Stages) to let more "parallel" stages
>> be computed? Why would I need to go for FAIR ever?
>>
>>
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> On Thu, Jun 16, 2016 at 4:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> If YARN is chosen as the Spark resource scheduler then the resource
>>> scheduling is handled to YARN. In YARN, the ResourceManager is a resource
>>> scheduler. It optimizes for cluster resource utilization to keep all
>>> resources in use all the time. It assumes the responsibility to negotiate a
>>> specified container in which to start the ApplicationMaster and then
>>> launches the ApplicationMaster. A Container represents a collection of
>>> physical resources such as allocated memory (RAM) and CPU cores.
>>>
>>> So back to your point in YARN MODE, i.e. --master yarn, if they are
>>> resources available then yarn would kick of another container. You can see
>>> that from yarn_resource_manager and yarn_node_manager logs.
>>>
>>> You also mentioned
>>>
>>> You can also spark-submit a Spark application using FAIR scheduler
>>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>>
>>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>>> They can also control the resource shares assigned to Spark
>>> jobs/applications. You could sc.setLocalProperty to control what pool
>>> to use.
>>>
>>> The notion of pools is nothing new. Most threaded model architecture use
>>> pools. However, I am not sure how many users/resource manager go ahead and
>>> create pools. In real life I don't think many people bother. I think I am
>>> looking at this from a practical point of view as opposed to what the scala
>>> code is saying which I believe you are alluding to.  I guess -c is short
>>> for --conf.  Yes you can do that via the following:
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>                  ................... \
>>>                 --conf "spark.scheduler.mode=FAIR" \
>>>
>>> You can even run in local mode via FAIR scheduler. Setting up these
>>> parameters don't affect how the actual job runs if the resource is not
>>> available. In other words the default FIFO will apply.  I am not convinced
>>> the validity of some of these parameters in real life.
>>>
>>> Like most things the proof of the pudding is in the eating. These
>>> theoretical points have to be established through experiment. For example
>>> in local mode the default scheduler is FIFO which seems reasonable.
>>> However, I can instruct Spark to run with FAIR scheduler although it has no
>>> bearing in real life
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>>                 --driver-memory 2G \
>>>                 --num-executors 1 \
>>>                 --executor-memory 2G \
>>>                 --master local \
>>>                 --executor-cores 2 \
>>>
>>> *--conf "spark.scheduler.mode=FAIR" \*                --conf
>>> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps" \
>>>                 --jars
>>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>>                 --class "${FILE_NAME}" \
>>>                 --conf "spark.ui.port=${SP}" \
>>>                 --conf "spark.driver.port=54631" \
>>>                 --conf "spark.fileserver.port=54731" \
>>>                 --conf "spark.blockManager.port=54832" \
>>>                 --conf "spark.kryoserializer.buffer.max=512" \
>>>                 ${JAR_FILE}
>>>
>>>
>>>
>>> and you can see that in GUI under environment tab
>>>
>>> In local mode I can submit as many spark-submit job if I wish. The
>>> constraint would be the recourses within the host box. One JVM runs
>>> independent of another.
>>>
>>>
>>>
>>> [image: Inline images 1]
>>>
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 June 2016 at 12:37, Jacek Laskowski <ja...@japila.pl> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to get my head around the different parts of Spark on YARN
>>>> architecture with YARN's schedulers and queues as well as Spark's own
>>>> schedulers - FAIR and FIFO.
>>>>
>>>> I'd appreciate if you could read how I see things and correct me where
>>>> I'm wrong. Thanks!
>>>>
>>>> The default scheduler in YARN is Capacity Scheduler [1]. It comes with
>>>> the notion of queues. When you spark-submit a Spark application with
>>>> --master yarn, you can specify --queue for the scheduling queue and it
>>>> is **only** to offer the right share of CPUs and memory to the
>>>> application. There could be more resources in the cluster, but that
>>>> particular queue has only that exact share of vcores and memory.
>>>>
>>>> In other words, Spark does not know about any other resources but the
>>>> ones available in the queue.
>>>>
>>>> Is this correct?
>>>>
>>>> You can also spark-submit a Spark application using FAIR scheduler
>>>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>>>
>>>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>>>> They can also control the resource shares assigned to Spark
>>>> jobs/applications. You could sc.setLocalProperty to control what pool
>>>> to use.
>>>>
>>>> Is this correct?
>>>>
>>>> If both are yes, why would I want to go as far as using queues and
>>>> FAIR scheduling mode with pools? What are the benefits? Is this for
>>>> multi-tenant environments? Do you have any use cases that would fit
>>>> better with FAIR scheduling mode? What about YARN's queues with Spark
>>>> on YARN?
>>>>
>>>> Share as much as you could since the topic bothers me so much (and
>>>> without your support I won't be able to recover from this painful
>>>> mental state :))
>>>>
>>>> Thanks for reading so far! Appreciate any help.
>>>>
>>>> [1]
>>>> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> ----
>>>> https://medium.com/@jaceklaskowski/
>>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: [YARN] Questions about YARN's queues and Spark's FAIR scheduler

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Jacek,

Your point

" it could use FIFO or FAIR task scheduling. My question is when would I
need to use FAIR?... "

Good point and this is my two cents on this.

FAIR scheduling (in the realm of YARN as resource manager) is a method of
assigning resources to Spark jobs such that all jobs get, on average, *an
equal share of resources over time*. When there is a single job running
within a YARN cluster, that job uses the entire cluster.



Now when other Spark jobs are submitted, tasks slots that free up are
assigned to the new jobs, so that each job gets roughly the same amount of
core time. I think in FAIR mode YARN forms a queue of jobs (being FAIR).
This lets short jobs finish in reasonable time while not starving long
jobs. It is also a reasonable way to share a cluster between a number of
users. Finally, FAIR sharing can also work with job priorities. The
priorities are used as weights to determine the fraction of total compute
time that each job should get. I have never tried this myself.


HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 16:11, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> Thanks for your prompt answer.
>
> You said "the resource scheduling is handled to YARN" so it's only about
> vcores and memory, right? Once Spark has the resources (be it as a custom
> queue in YARN's Capacity Scheduler or default), it could use FIFO or FAIR
> task scheduling. My question is when would I need to use FAIR? Is this
> about TaskSetManagers (that represent Stages) to let more "parallel" stages
> be computed? Why would I need to go for FAIR ever?
>
>
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Thu, Jun 16, 2016 at 4:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> If YARN is chosen as the Spark resource scheduler then the resource
>> scheduling is handled to YARN. In YARN, the ResourceManager is a resource
>> scheduler. It optimizes for cluster resource utilization to keep all
>> resources in use all the time. It assumes the responsibility to negotiate a
>> specified container in which to start the ApplicationMaster and then
>> launches the ApplicationMaster. A Container represents a collection of
>> physical resources such as allocated memory (RAM) and CPU cores.
>>
>> So back to your point in YARN MODE, i.e. --master yarn, if they are
>> resources available then yarn would kick of another container. You can see
>> that from yarn_resource_manager and yarn_node_manager logs.
>>
>> You also mentioned
>>
>> You can also spark-submit a Spark application using FAIR scheduler
>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>
>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>> They can also control the resource shares assigned to Spark
>> jobs/applications. You could sc.setLocalProperty to control what pool
>> to use.
>>
>> The notion of pools is nothing new. Most threaded model architecture use
>> pools. However, I am not sure how many users/resource manager go ahead and
>> create pools. In real life I don't think many people bother. I think I am
>> looking at this from a practical point of view as opposed to what the scala
>> code is saying which I believe you are alluding to.  I guess -c is short
>> for --conf.  Yes you can do that via the following:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                  ................... \
>>                 --conf "spark.scheduler.mode=FAIR" \
>>
>> You can even run in local mode via FAIR scheduler. Setting up these
>> parameters don't affect how the actual job runs if the resource is not
>> available. In other words the default FIFO will apply.  I am not convinced
>> the validity of some of these parameters in real life.
>>
>> Like most things the proof of the pudding is in the eating. These
>> theoretical points have to be established through experiment. For example
>> in local mode the default scheduler is FIFO which seems reasonable.
>> However, I can instruct Spark to run with FAIR scheduler although it has no
>> bearing in real life
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>                 --driver-memory 2G \
>>                 --num-executors 1 \
>>                 --executor-memory 2G \
>>                 --master local \
>>                 --executor-cores 2 \
>>
>> *--conf "spark.scheduler.mode=FAIR" \*                --conf
>> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps" \
>>                 --jars
>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>                 --class "${FILE_NAME}" \
>>                 --conf "spark.ui.port=${SP}" \
>>                 --conf "spark.driver.port=54631" \
>>                 --conf "spark.fileserver.port=54731" \
>>                 --conf "spark.blockManager.port=54832" \
>>                 --conf "spark.kryoserializer.buffer.max=512" \
>>                 ${JAR_FILE}
>>
>>
>>
>> and you can see that in GUI under environment tab
>>
>> In local mode I can submit as many spark-submit job if I wish. The
>> constraint would be the recourses within the host box. One JVM runs
>> independent of another.
>>
>>
>>
>> [image: Inline images 1]
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 June 2016 at 12:37, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to get my head around the different parts of Spark on YARN
>>> architecture with YARN's schedulers and queues as well as Spark's own
>>> schedulers - FAIR and FIFO.
>>>
>>> I'd appreciate if you could read how I see things and correct me where
>>> I'm wrong. Thanks!
>>>
>>> The default scheduler in YARN is Capacity Scheduler [1]. It comes with
>>> the notion of queues. When you spark-submit a Spark application with
>>> --master yarn, you can specify --queue for the scheduling queue and it
>>> is **only** to offer the right share of CPUs and memory to the
>>> application. There could be more resources in the cluster, but that
>>> particular queue has only that exact share of vcores and memory.
>>>
>>> In other words, Spark does not know about any other resources but the
>>> ones available in the queue.
>>>
>>> Is this correct?
>>>
>>> You can also spark-submit a Spark application using FAIR scheduler
>>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>>
>>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>>> They can also control the resource shares assigned to Spark
>>> jobs/applications. You could sc.setLocalProperty to control what pool
>>> to use.
>>>
>>> Is this correct?
>>>
>>> If both are yes, why would I want to go as far as using queues and
>>> FAIR scheduling mode with pools? What are the benefits? Is this for
>>> multi-tenant environments? Do you have any use cases that would fit
>>> better with FAIR scheduling mode? What about YARN's queues with Spark
>>> on YARN?
>>>
>>> Share as much as you could since the topic bothers me so much (and
>>> without your support I won't be able to recover from this painful
>>> mental state :))
>>>
>>> Thanks for reading so far! Appreciate any help.
>>>
>>> [1]
>>> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: [YARN] Questions about YARN's queues and Spark's FAIR scheduler

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

Thanks for your prompt answer.

You said "the resource scheduling is handled to YARN" so it's only about
vcores and memory, right? Once Spark has the resources (be it as a custom
queue in YARN's Capacity Scheduler or default), it could use FIFO or FAIR
task scheduling. My question is when would I need to use FAIR? Is this
about TaskSetManagers (that represent Stages) to let more "parallel" stages
be computed? Why would I need to go for FAIR ever?



Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Thu, Jun 16, 2016 at 4:08 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> If YARN is chosen as the Spark resource scheduler then the resource
> scheduling is handled to YARN. In YARN, the ResourceManager is a resource
> scheduler. It optimizes for cluster resource utilization to keep all
> resources in use all the time. It assumes the responsibility to negotiate a
> specified container in which to start the ApplicationMaster and then
> launches the ApplicationMaster. A Container represents a collection of
> physical resources such as allocated memory (RAM) and CPU cores.
>
> So back to your point in YARN MODE, i.e. --master yarn, if they are
> resources available then yarn would kick of another container. You can see
> that from yarn_resource_manager and yarn_node_manager logs.
>
> You also mentioned
>
> You can also spark-submit a Spark application using FAIR scheduler
> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>
> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
> They can also control the resource shares assigned to Spark
> jobs/applications. You could sc.setLocalProperty to control what pool
> to use.
>
> The notion of pools is nothing new. Most threaded model architecture use
> pools. However, I am not sure how many users/resource manager go ahead and
> create pools. In real life I don't think many people bother. I think I am
> looking at this from a practical point of view as opposed to what the scala
> code is saying which I believe you are alluding to.  I guess -c is short
> for --conf.  Yes you can do that via the following:
>
> ${SPARK_HOME}/bin/spark-submit \
>                  ................... \
>                 --conf "spark.scheduler.mode=FAIR" \
>
> You can even run in local mode via FAIR scheduler. Setting up these
> parameters don't affect how the actual job runs if the resource is not
> available. In other words the default FIFO will apply.  I am not convinced
> the validity of some of these parameters in real life.
>
> Like most things the proof of the pudding is in the eating. These
> theoretical points have to be established through experiment. For example
> in local mode the default scheduler is FIFO which seems reasonable.
> However, I can instruct Spark to run with FAIR scheduler although it has no
> bearing in real life
>
> ${SPARK_HOME}/bin/spark-submit \
>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>                 --driver-memory 2G \
>                 --num-executors 1 \
>                 --executor-memory 2G \
>                 --master local \
>                 --executor-cores 2 \
>
> *--conf "spark.scheduler.mode=FAIR" \*                --conf
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps" \
>                 --jars
> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>                 --class "${FILE_NAME}" \
>                 --conf "spark.ui.port=${SP}" \
>                 --conf "spark.driver.port=54631" \
>                 --conf "spark.fileserver.port=54731" \
>                 --conf "spark.blockManager.port=54832" \
>                 --conf "spark.kryoserializer.buffer.max=512" \
>                 ${JAR_FILE}
>
>
>
> and you can see that in GUI under environment tab
>
> In local mode I can submit as many spark-submit job if I wish. The
> constraint would be the recourses within the host box. One JVM runs
> independent of another.
>
>
>
> [image: Inline images 1]
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 12:37, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> I'm trying to get my head around the different parts of Spark on YARN
>> architecture with YARN's schedulers and queues as well as Spark's own
>> schedulers - FAIR and FIFO.
>>
>> I'd appreciate if you could read how I see things and correct me where
>> I'm wrong. Thanks!
>>
>> The default scheduler in YARN is Capacity Scheduler [1]. It comes with
>> the notion of queues. When you spark-submit a Spark application with
>> --master yarn, you can specify --queue for the scheduling queue and it
>> is **only** to offer the right share of CPUs and memory to the
>> application. There could be more resources in the cluster, but that
>> particular queue has only that exact share of vcores and memory.
>>
>> In other words, Spark does not know about any other resources but the
>> ones available in the queue.
>>
>> Is this correct?
>>
>> You can also spark-submit a Spark application using FAIR scheduler
>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>
>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>> They can also control the resource shares assigned to Spark
>> jobs/applications. You could sc.setLocalProperty to control what pool
>> to use.
>>
>> Is this correct?
>>
>> If both are yes, why would I want to go as far as using queues and
>> FAIR scheduling mode with pools? What are the benefits? Is this for
>> multi-tenant environments? Do you have any use cases that would fit
>> better with FAIR scheduling mode? What about YARN's queues with Spark
>> on YARN?
>>
>> Share as much as you could since the topic bothers me so much (and
>> without your support I won't be able to recover from this painful
>> mental state :))
>>
>> Thanks for reading so far! Appreciate any help.
>>
>> [1]
>> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: [YARN] Questions about YARN's queues and Spark's FAIR scheduler

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

If YARN is chosen as the Spark resource scheduler then the resource
scheduling is handled to YARN. In YARN, the ResourceManager is a resource
scheduler. It optimizes for cluster resource utilization to keep all
resources in use all the time. It assumes the responsibility to negotiate a
specified container in which to start the ApplicationMaster and then
launches the ApplicationMaster. A Container represents a collection of
physical resources such as allocated memory (RAM) and CPU cores.

So back to your point in YARN MODE, i.e. --master yarn, if they are
resources available then yarn would kick of another container. You can see
that from yarn_resource_manager and yarn_node_manager logs.

You also mentioned

You can also spark-submit a Spark application using FAIR scheduler
(the default is FIFO) using -c spark.scheduler.mode=FAIR.

In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
They can also control the resource shares assigned to Spark
jobs/applications. You could sc.setLocalProperty to control what pool
to use.

The notion of pools is nothing new. Most threaded model architecture use
pools. However, I am not sure how many users/resource manager go ahead and
create pools. In real life I don't think many people bother. I think I am
looking at this from a practical point of view as opposed to what the scala
code is saying which I believe you are alluding to.  I guess -c is short
for --conf.  Yes you can do that via the following:

${SPARK_HOME}/bin/spark-submit \
                 ................... \
                --conf "spark.scheduler.mode=FAIR" \

You can even run in local mode via FAIR scheduler. Setting up these
parameters don't affect how the actual job runs if the resource is not
available. In other words the default FIFO will apply.  I am not convinced
the validity of some of these parameters in real life.

Like most things the proof of the pudding is in the eating. These
theoretical points have to be established through experiment. For example
in local mode the default scheduler is FIFO which seems reasonable.
However, I can instruct Spark to run with FAIR scheduler although it has no
bearing in real life

${SPARK_HOME}/bin/spark-submit \
                --packages com.databricks:spark-csv_2.11:1.3.0 \
                --driver-memory 2G \
                --num-executors 1 \
                --executor-memory 2G \
                --master local \
                --executor-cores 2 \

*--conf "spark.scheduler.mode=FAIR" \*                --conf
"spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps" \
                --jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
                --class "${FILE_NAME}" \
                --conf "spark.ui.port=${SP}" \
                --conf "spark.driver.port=54631" \
                --conf "spark.fileserver.port=54731" \
                --conf "spark.blockManager.port=54832" \
                --conf "spark.kryoserializer.buffer.max=512" \
                ${JAR_FILE}

and you can see that in GUI under environment tab

In local mode I can submit as many spark-submit job if I wish. The
constraint would be the recourses within the host box. One JVM runs
independent of another.

[image: Inline images 1]

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 16 June 2016 at 12:37, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I'm trying to get my head around the different parts of Spark on YARN
> architecture with YARN's schedulers and queues as well as Spark's own
> schedulers - FAIR and FIFO.
>
> I'd appreciate if you could read how I see things and correct me where
> I'm wrong. Thanks!
>
> The default scheduler in YARN is Capacity Scheduler [1]. It comes with
> the notion of queues. When you spark-submit a Spark application with
> --master yarn, you can specify --queue for the scheduling queue and it
> is **only** to offer the right share of CPUs and memory to the
> application. There could be more resources in the cluster, but that
> particular queue has only that exact share of vcores and memory.
>
> In other words, Spark does not know about any other resources but the
> ones available in the queue.
>
> Is this correct?
>
> You can also spark-submit a Spark application using FAIR scheduler
> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>
> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
> They can also control the resource shares assigned to Spark
> jobs/applications. You could sc.setLocalProperty to control what pool
> to use.
>
> Is this correct?
>
> If both are yes, why would I want to go as far as using queues and
> FAIR scheduling mode with pools? What are the benefits? Is this for
> multi-tenant environments? Do you have any use cases that would fit
> better with FAIR scheduling mode? What about YARN's queues with Spark
> on YARN?
>
> Share as much as you could since the topic bothers me so much (and
> without your support I won't be able to recover from this painful
> mental state :))
>
> Thanks for reading so far! Appreciate any help.
>
> [1]
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>