You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sophia <sl...@163.com> on 2014/05/04 09:56:49 UTC

different in spark on yarn mode and standalone mode

Hey you guys,
What is the different in spark on yarn mode and standalone mode about
resource schedule?
Wish you happy everyday.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: different in spark on yarn mode and standalone mode

Posted by Vipul Pandey <vi...@gmail.com>.

And I thought I sent it to the right list! Here you go again - Question below : 

On May 14, 2014, at 3:06 PM, Vipul Pandey <vi...@gmail.com> wrote:

> So here's a followup question : What's the preferred mode? 
> We have a new cluster coming up with petabytes of data and we intend to take Spark to production. We are trying to figure out what mode would be safe and stable for production like environment. 
> pros and cons? anyone? 
> 
> Any reasons why one would chose Standalone over YARN?
> 
> Thanks,
> Vipul





> 
> On May 4, 2014, at 5:56 PM, Liu, Raymond <ra...@intel.com> wrote:
> 
>> In the core, they are not quite different
>> In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app.
>> While in Yarn mode, Yarn resource manager and node manager do this work.
>> When the driver and executors have been launched, the rest part of resource scheduling go through the same process, say between driver and executor through akka actor.
>> 
>> Best Regards,
>> Raymond Liu
>> 
>> 
>> -----Original Message-----
>> From: Sophia [mailto:sln-1026@163.com] 
>> 
>> Hey you guys,
>> What is the different in spark on yarn mode and standalone mode about resource schedule?
>> Wish you happy everyday.
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: different in spark on yarn mode and standalone mode

Posted by Sandy Ryza <sa...@cloudera.com>.

We made several stabilization changes to Spark on YARN that made it
into Spark 0.9.1 and CDH5.0.  1.0 significantly simplifies submitting a
Spark app to a YARN cluster (wildly different invocations are no longer
needed for yarn-client and yarn-cluster mode).

I'm not sure about who is running it in production, but we have several
customers trying it out.  Some on pretty large clusters.

You can view the app information you mentioned in YARN as well.  The YARN
ResourceManager UI has a link to the app's UI.  A limitation I forgot to
mention with YARN is that it's currently more difficult to view executor
logs of completed applications.  You need to use the CLI (yarn logs
-applicationId <appid>), whereas the standalone master displays them in its
web UI.  We hope to address this soon through YARN's generic timeline store.

-Sandy

On Fri, May 16, 2014 at 3:30 PM, Vipul Pandey <vi...@gmail.com> wrote:

> Thanks for responding, Sandy.
>
> YARN for sure is a more mature way of working on shared resources. I was
> not sure about how stable Spark on YARN is and if anyone is using it in
> production.
> I have been using Standalone mode in our dev cluster but multi-tenancy and
> resource allocation wise it's difficult to call it production ready yet.
> (I'm not sure if 1.0 has significant changes or not as I haven't kept up
> lately)
>
> What I get from your response below is that for production like
> environment YARN will be a better choice as, for our case, we don't care
> too much about saving a few seconds in startup time. Stability will
> definitely be a concern but I"m assuming that Spark on Yarn is not terrible
> either and will mature over the period of time, in which case we don't have
> to compromise on other important factors (like resource sharing and
> prioritization)
>
> btw, can I see information on what RDDs are cached and their size etc. on
> YARN? like I see in the standalone mode UI?
>
>
> ~Vipul
>
> On May 15, 2014, at 5:24 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>
> Hi Vipul,
>
> Some advantages of using YARN:
> * YARN allows you to dynamically share and centrally configure the same
> pool of cluster resources between all frameworks that run on YARN.  You can
> throw your entire cluster at a MapReduce job, then use some of it on an
> Impala query and the rest on Spark application, without any changes in
> configuration.
> * You can take advantage of all the features of YARN schedulers for
> categorizing, isolating, and prioritizing workloads.
> * YARN provides CPU-isolation between processes with CGroups. Spark
> standalone mode requires each application to run an executor on every node
> in the cluster - with YARN, you choose the number of executors to use.
> * YARN is the only cluster manager for Spark that supports security and
> Kerberized clusters.
>
> Some advantages of using standalone:
> * It has been around for longer, so it is likely a little more stable.
> * Many report faster startup times for apps.
>
> -Sandy
>
>
> On Wed, May 14, 2014 at 3:06 PM, Vipul Pandey <vi...@gmail.com> wrote:
>
>> So here's a followup question : What's the preferred mode?
>> We have a new cluster coming up with petabytes of data and we intend to
>> take Spark to production. We are trying to figure out what mode would be
>> safe and stable for production like environment.
>> pros and cons? anyone?
>>
>> Any reasons why one would chose Standalone over YARN?
>>
>> Thanks,
>> Vipul
>>
>> On May 4, 2014, at 5:56 PM, Liu, Raymond <ra...@intel.com> wrote:
>>
>> > In the core, they are not quite different
>> > In standalone mode, you have spark master and spark worker who allocate
>> driver and executors for your spark app.
>> > While in Yarn mode, Yarn resource manager and node manager do this work.
>> > When the driver and executors have been launched, the rest part of
>> resource scheduling go through the same process, say between driver and
>> executor through akka actor.
>> >
>> > Best Regards,
>> > Raymond Liu
>> >
>> >
>> > -----Original Message-----
>> > From: Sophia [mailto:sln-1026@163.com]
>> >
>> > Hey you guys,
>> > What is the different in spark on yarn mode and standalone mode about
>> resource schedule?
>> > Wish you happy everyday.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>>
>
>

Re: different in spark on yarn mode and standalone mode

Posted by Vipul Pandey <vi...@gmail.com>.

Thanks for responding, Sandy. 

YARN for sure is a more mature way of working on shared resources. I was not sure about how stable Spark on YARN is and if anyone is using it in production. 
I have been using Standalone mode in our dev cluster but multi-tenancy and resource allocation wise it's difficult to call it production ready yet. (I'm not sure if 1.0 has significant changes or not as I haven't kept up lately)

What I get from your response below is that for production like environment YARN will be a better choice as, for our case, we don't care too much about saving a few seconds in startup time. Stability will definitely be a concern but I"m assuming that Spark on Yarn is not terrible either and will mature over the period of time, in which case we don't have to compromise on other important factors (like resource sharing and prioritization)

btw, can I see information on what RDDs are cached and their size etc. on YARN? like I see in the standalone mode UI?

~Vipul

On May 15, 2014, at 5:24 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Vipul,
> 
> Some advantages of using YARN:
> * YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.  You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.
> * You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
> * YARN provides CPU-isolation between processes with CGroups. Spark standalone mode requires each application to run an executor on every node in the cluster - with YARN, you choose the number of executors to use.
> * YARN is the only cluster manager for Spark that supports security and Kerberized clusters.
> 
> Some advantages of using standalone:
> * It has been around for longer, so it is likely a little more stable.
> * Many report faster startup times for apps.
> 
> -Sandy
> 
> 
> On Wed, May 14, 2014 at 3:06 PM, Vipul Pandey <vi...@gmail.com> wrote:
> So here's a followup question : What's the preferred mode?
> We have a new cluster coming up with petabytes of data and we intend to take Spark to production. We are trying to figure out what mode would be safe and stable for production like environment.
> pros and cons? anyone?
> 
> Any reasons why one would chose Standalone over YARN?
> 
> Thanks,
> Vipul
> 
> On May 4, 2014, at 5:56 PM, Liu, Raymond <ra...@intel.com> wrote:
> 
> > In the core, they are not quite different
> > In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app.
> > While in Yarn mode, Yarn resource manager and node manager do this work.
> > When the driver and executors have been launched, the rest part of resource scheduling go through the same process, say between driver and executor through akka actor.
> >
> > Best Regards,
> > Raymond Liu
> >
> >
> > -----Original Message-----
> > From: Sophia [mailto:sln-1026@163.com]
> >
> > Hey you guys,
> > What is the different in spark on yarn mode and standalone mode about resource schedule?
> > Wish you happy everyday.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
>

Re: different in spark on yarn mode and standalone mode

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Vipul,

Some advantages of using YARN:
* YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.  You can
throw your entire cluster at a MapReduce job, then use some of it on an
Impala query and the rest on Spark application, without any changes in
configuration.
* You can take advantage of all the features of YARN schedulers for
categorizing, isolating, and prioritizing workloads.
* YARN provides CPU-isolation between processes with CGroups. Spark
standalone mode requires each application to run an executor on every node
in the cluster - with YARN, you choose the number of executors to use.
* YARN is the only cluster manager for Spark that supports security and
Kerberized clusters.

Some advantages of using standalone:
* It has been around for longer, so it is likely a little more stable.
* Many report faster startup times for apps.

-Sandy

On Wed, May 14, 2014 at 3:06 PM, Vipul Pandey <vi...@gmail.com> wrote:

> So here's a followup question : What's the preferred mode?
> We have a new cluster coming up with petabytes of data and we intend to
> take Spark to production. We are trying to figure out what mode would be
> safe and stable for production like environment.
> pros and cons? anyone?
>
> Any reasons why one would chose Standalone over YARN?
>
> Thanks,
> Vipul
>
> On May 4, 2014, at 5:56 PM, Liu, Raymond <ra...@intel.com> wrote:
>
> > In the core, they are not quite different
> > In standalone mode, you have spark master and spark worker who allocate
> driver and executors for your spark app.
> > While in Yarn mode, Yarn resource manager and node manager do this work.
> > When the driver and executors have been launched, the rest part of
> resource scheduling go through the same process, say between driver and
> executor through akka actor.
> >
> > Best Regards,
> > Raymond Liu
> >
> >
> > -----Original Message-----
> > From: Sophia [mailto:sln-1026@163.com]
> >
> > Hey you guys,
> > What is the different in spark on yarn mode and standalone mode about
> resource schedule?
> > Wish you happy everyday.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: different in spark on yarn mode and standalone mode

Posted by Vipul Pandey <vi...@gmail.com>.

So here's a followup question : What's the preferred mode? 
We have a new cluster coming up with petabytes of data and we intend to take Spark to production. We are trying to figure out what mode would be safe and stable for production like environment. 
pros and cons? anyone? 

Any reasons why one would chose Standalone over YARN?

Thanks,
Vipul

On May 4, 2014, at 5:56 PM, Liu, Raymond <ra...@intel.com> wrote:

> In the core, they are not quite different
> In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app.
> While in Yarn mode, Yarn resource manager and node manager do this work.
> When the driver and executors have been launched, the rest part of resource scheduling go through the same process, say between driver and executor through akka actor.
> 
> Best Regards,
> Raymond Liu
> 
> 
> -----Original Message-----
> From: Sophia [mailto:sln-1026@163.com] 
> 
> Hey you guys,
> What is the different in spark on yarn mode and standalone mode about resource schedule?
> Wish you happy everyday.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: different in spark on yarn mode and standalone mode

Posted by "Liu, Raymond" <ra...@intel.com>.

In the core, they are not quite different
In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app.
While in Yarn mode, Yarn resource manager and node manager do this work.
When the driver and executors have been launched, the rest part of resource scheduling go through the same process, say between driver and executor through akka actor.

Best Regards,
Raymond Liu


-----Original Message-----
From: Sophia [mailto:sln-1026@163.com] 

Hey you guys,
What is the different in spark on yarn mode and standalone mode about resource schedule?
Wish you happy everyday.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.