You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by polkosity <po...@gmail.com> on 2014/02/25 07:22:00 UTC

Job initialization performance of Spark standalone mode vs YARN

Is there any difference in the performance of Spark standalone mode and YARN
when it comes to initializing a new Spark job?  

In my application, response time is absolutely critical, and I'm hoping to
have the executors working within a few seconds of submitting the job.

Both options ran quickly for me (running the SparkPi example) in a single
node cluster, only a couple of seconds until executors began work.  On my 10
node cluster it takes YARN over 10 seconds before the executors actually
begin work.  Could I expect Spark standalone to get going any quicker?  If
so I will take the time to configure it on 10 node cluster.

Why does the example run so much quicker on my local single node cluster
than on my 10 EC2 m1.larges?

Aside from YARN being able to schedule Spark, MRv2 and other job types, are
there any major differences between Spark standalone and YARN?

Thanks.
- Dan



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Koert Kuipers <ko...@tresata.com>.

to be more precise, the difference depends on de-serialization overhead
from kryo for your data structures.


On Mon, Mar 3, 2014 at 8:21 PM, Koert Kuipers <ko...@tresata.com> wrote:

> yes, tachyon is in memory serialized, which is not as fast as cached in
> memory in spark (not serialized). the difference really depends on your job
> type.
>
>
>
> On Mon, Mar 3, 2014 at 7:10 PM, polkosity <po...@gmail.com> wrote:
>
>> Thats exciting!  Will be looking into that, thanks Andrew.
>>
>> Related topic, has anyone had any experience running Spark on Tachyon
>> in-memory filesystem, and could offer their views on using it?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Koert Kuipers <ko...@tresata.com>.

yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.

On Mon, Mar 3, 2014 at 7:10 PM, polkosity <po...@gmail.com> wrote:

> Thats exciting!  Will be looking into that, thanks Andrew.
>
> Related topic, has anyone had any experience running Spark on Tachyon
> in-memory filesystem, and could offer their views on using it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

+1


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 3, 2014 at 4:10 PM, polkosity <po...@gmail.com> wrote:

> Thats exciting!  Will be looking into that, thanks Andrew.
>
> Related topic, has anyone had any experience running Spark on Tachyon
> in-memory filesystem, and could offer their views on using it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

Thats exciting!  Will be looking into that, thanks Andrew.

Related topic, has anyone had any experience running Spark on Tachyon
in-memory filesystem, and could offer their views on using it? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Andrew Ash <an...@andrewash.com>.

polkosity, have you seen the job server that Ooyala open sourced?  I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.

https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server


On Mon, Mar 3, 2014 at 3:30 PM, polkosity <po...@gmail.com> wrote:

> We're thinking of creating a Spark job server with a REST API, which would
> enable us (as well as managing jobs) to re-use the spark context as you
> suggest.  Thanks Koert!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2263.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

We're thinking of creating a Spark job server with a REST API, which would
enable us (as well as managing jobs) to re-use the spark context as you
suggest.  Thanks Koert!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

Would you be the best person in the world & share some code. Its a pretty
common problem .
On Mar 6, 2014 6:36 PM, "polkosity" <po...@gmail.com> wrote:

> We're not using Ooyala's job server.  We are holding the spark context for
> reuse within our own REST server (with a service to run each job).
>
> Our low-latency job now reads all its data from a memory cached RDD,
> instead
> of from HDFS seq file (upstream jobs cache resultant RDDs for downstream
> jobs to read).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2384.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

We're not using Ooyala's job server.  We are holding the spark context for
reuse within our own REST server (with a service to run each job).

Our low-latency job now reads all its data from a memory cached RDD, instead
of from HDFS seq file (upstream jobs cache resultant RDDs for downstream
jobs to read).



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2384.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

are you using job server or just reusing spark context?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 10:30 PM, polkosity <po...@gmail.com> wrote:

> After changing to reuse spark context and cache RDDs in memory, performance
> is 4 times better.  We didn't expect that much of an improvement!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2340.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

After changing to reuse spark context and cache RDDs in memory, performance
is 4 times better.  We didn't expect that much of an improvement!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Koert Kuipers <ko...@tresata.com>.

If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, "polkosity" <po...@gmail.com> wrote:

> Thanks for the advice Mayur.
>
> I thought I'd report back on the performance difference...  Spark
> standalone
> mode has executors processing at capacity in under a second :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Ron Gonzalez <zl...@yahoo.com>.

Hi,
  Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way.  Shouldn't Client.scala be running within the AppMaster instance in this run mode?
  How exactly does yarn-standalone work?

Thanks,
Ron

Sent from my iPhone

> On Apr 3, 2014, at 11:19 AM, Kevin Markey <ke...@oracle.com> wrote:
> 
> We are now testing precisely what you ask about in our environment.  But Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster.
> 
> The "client" options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster.  The "standalone" options locate the master/client on the cluster.  In yarn-standalone, the master is a thread contained by the Yarn Resource Manager.  Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency.
> 
> In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone!  At least for a job with lots of stages and lots of client/worker communication, although rather few "collect" actions, so it's mainly scheduling that's relevant here.
> 
> I'll be posting more information as I have it available.
> 
> Kevin
> 
> 
>> On 03/03/2014 03:48 PM, Sandy Ryza wrote:
>> Are you running in yarn-standalone mode or yarn-client mode?  Also, what YARN scheduler and what NodeManager heartbeat?  
>> 
>> 
>> On Sun, Mar 2, 2014 at 9:41 PM, polkosity <po...@gmail.com> wrote:
>>> Thanks for the advice Mayur.
>>> 
>>> I thought I'd report back on the performance difference...  Spark standalone
>>> mode has executors processing at capacity in under a second :)
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Sandy Ryza <sa...@cloudera.com>.

Are you running in yarn-standalone mode or yarn-client mode?  Also, what
YARN scheduler and what NodeManager heartbeat?


On Sun, Mar 2, 2014 at 9:41 PM, polkosity <po...@gmail.com> wrote:

> Thanks for the advice Mayur.
>
> I thought I'd report back on the performance difference...  Spark
> standalone
> mode has executors processing at capacity in under a second :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

Thanks for the advice Mayur.

I thought I'd report back on the performance difference...  Spark standalone
mode has executors processing at capacity in under a second :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Mon, Feb 24, 2014 at 10:22 PM, polkosity <po...@gmail.com> wrote:

> Is there any difference in the performance of Spark standalone mode and
> YARN
> when it comes to initializing a new Spark job?
>
Yes Yarn is a much more complex cluster manager than the one provided by
Spark Standalone.

>
> In my application, response time is absolutely critical, and I'm hoping to
> have the executors working within a few seconds of submitting the job.
>
> Both options ran quickly for me (running the SparkPi example) in a single
> node cluster, only a couple of seconds until executors began work.  On my
> 10
> node cluster it takes YARN over 10 seconds before the executors actually
> begin work.  Could I expect Spark standalone to get going any quicker?  If
> so I will take the time to configure it on 10 node cluster.
>
Yes Spark standalone is much much faster & can be prefered if you are not
running any other applications (like hive, hbase, etc ) on the cluster. I
get very responsive 2-3sec response time in standalone mode with 10
machines.

>
> Why does the example run so much quicker on my local single node cluster
> than on my 10 EC2 m1.larges?


> Aside from YARN being able to schedule Spark, MRv2 and other job types, are
> there any major differences between Spark standalone and YARN?
>
Yarn has much more granular control over the cluster resources. You can
also look into Mesos for management which will be much faster than Yarn for
now.

>
> Thanks.
> - Dan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>