You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by polkosity <po...@gmail.com> on 2014/03/03 06:41:52 UTC

Re: Job initialization performance of Spark standalone mode vs YARN

Thanks for the advice Mayur.

I thought I'd report back on the performance difference...  Spark standalone
mode has executors processing at capacity in under a second :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Koert Kuipers <ko...@tresata.com>.

to be more precise, the difference depends on de-serialization overhead
from kryo for your data structures.


On Mon, Mar 3, 2014 at 8:21 PM, Koert Kuipers <ko...@tresata.com> wrote:

> yes, tachyon is in memory serialized, which is not as fast as cached in
> memory in spark (not serialized). the difference really depends on your job
> type.
>
>
>
> On Mon, Mar 3, 2014 at 7:10 PM, polkosity <po...@gmail.com> wrote:
>
>> Thats exciting!  Will be looking into that, thanks Andrew.
>>
>> Related topic, has anyone had any experience running Spark on Tachyon
>> in-memory filesystem, and could offer their views on using it?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Koert Kuipers <ko...@tresata.com>.

yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.

On Mon, Mar 3, 2014 at 7:10 PM, polkosity <po...@gmail.com> wrote:

> Thats exciting!  Will be looking into that, thanks Andrew.
>
> Related topic, has anyone had any experience running Spark on Tachyon
> in-memory filesystem, and could offer their views on using it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

+1


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 3, 2014 at 4:10 PM, polkosity <po...@gmail.com> wrote:

> Thats exciting!  Will be looking into that, thanks Andrew.
>
> Related topic, has anyone had any experience running Spark on Tachyon
> in-memory filesystem, and could offer their views on using it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

Thats exciting!  Will be looking into that, thanks Andrew.

Related topic, has anyone had any experience running Spark on Tachyon
in-memory filesystem, and could offer their views on using it? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Andrew Ash <an...@andrewash.com>.

polkosity, have you seen the job server that Ooyala open sourced?  I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.

https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server


On Mon, Mar 3, 2014 at 3:30 PM, polkosity <po...@gmail.com> wrote:

> We're thinking of creating a Spark job server with a REST API, which would
> enable us (as well as managing jobs) to re-use the spark context as you
> suggest.  Thanks Koert!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2263.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

We're thinking of creating a Spark job server with a REST API, which would
enable us (as well as managing jobs) to re-use the spark context as you
suggest.  Thanks Koert!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

Would you be the best person in the world & share some code. Its a pretty
common problem .
On Mar 6, 2014 6:36 PM, "polkosity" <po...@gmail.com> wrote:

> We're not using Ooyala's job server.  We are holding the spark context for
> reuse within our own REST server (with a service to run each job).
>
> Our low-latency job now reads all its data from a memory cached RDD,
> instead
> of from HDFS seq file (upstream jobs cache resultant RDDs for downstream
> jobs to read).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2384.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

We're not using Ooyala's job server.  We are holding the spark context for
reuse within our own REST server (with a service to run each job).

Our low-latency job now reads all its data from a memory cached RDD, instead
of from HDFS seq file (upstream jobs cache resultant RDDs for downstream
jobs to read).



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2384.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Mayur Rustagi <ma...@gmail.com>.

are you using job server or just reusing spark context?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 10:30 PM, polkosity <po...@gmail.com> wrote:

> After changing to reuse spark context and cache RDDs in memory, performance
> is 4 times better.  We didn't expect that much of an improvement!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2340.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by polkosity <po...@gmail.com>.

After changing to reuse spark context and cache RDDs in memory, performance
is 4 times better.  We didn't expect that much of an improvement!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Koert Kuipers <ko...@tresata.com>.

If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, "polkosity" <po...@gmail.com> wrote:

> Thanks for the advice Mayur.
>
> I thought I'd report back on the performance difference...  Spark
> standalone
> mode has executors processing at capacity in under a second :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Ron Gonzalez <zl...@yahoo.com>.

Hi,
  Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way.  Shouldn't Client.scala be running within the AppMaster instance in this run mode?
  How exactly does yarn-standalone work?

Thanks,
Ron

Sent from my iPhone

> On Apr 3, 2014, at 11:19 AM, Kevin Markey <ke...@oracle.com> wrote:
> 
> We are now testing precisely what you ask about in our environment.  But Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster.
> 
> The "client" options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster.  The "standalone" options locate the master/client on the cluster.  In yarn-standalone, the master is a thread contained by the Yarn Resource Manager.  Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency.
> 
> In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone!  At least for a job with lots of stages and lots of client/worker communication, although rather few "collect" actions, so it's mainly scheduling that's relevant here.
> 
> I'll be posting more information as I have it available.
> 
> Kevin
> 
> 
>> On 03/03/2014 03:48 PM, Sandy Ryza wrote:
>> Are you running in yarn-standalone mode or yarn-client mode?  Also, what YARN scheduler and what NodeManager heartbeat?  
>> 
>> 
>> On Sun, Mar 2, 2014 at 9:41 PM, polkosity <po...@gmail.com> wrote:
>>> Thanks for the advice Mayur.
>>> 
>>> I thought I'd report back on the performance difference...  Spark standalone
>>> mode has executors processing at capacity in under a second :)
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Job initialization performance of Spark standalone mode vs YARN

Posted by Sandy Ryza <sa...@cloudera.com>.

Are you running in yarn-standalone mode or yarn-client mode?  Also, what
YARN scheduler and what NodeManager heartbeat?


On Sun, Mar 2, 2014 at 9:41 PM, polkosity <po...@gmail.com> wrote:

> Thanks for the advice Mayur.
>
> I thought I'd report back on the performance difference...  Spark
> standalone
> mode has executors processing at capacity in under a second :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>