You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ishaaq <is...@gmail.com> on 2014/04/15 14:28:03 UTC

standalone vs YARN

Hi all,
I am evaluating Spark to use here at my work.

We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop
2.3.

I am trying to work out whether I should install YARN or simply just setup a
Spark standalone cluster. We already use ZooKeeper so it isn't a problem to
setup HA. I am puzzled however as to how the Spark nodes can coordinate on
data locality - i.e., assuming I install the nodes on the same machines as
the DFS data nodes, I don't understand how Spark can work out which nodes
should get which splits of the jobs?

Anyway, my bigger question remains: YARN or standalone? Which is the more
stable option currently? Which is the more future-proof option?

Thanks,
Ishaaq 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/standalone-vs-YARN-tp4271.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: standalone vs YARN

Posted by Surendranauth Hiraman <su...@velos.io>.

Prashant,

In another email thread several weeks ago, it was mentioned that YARN
support is considered beta until Spark 1.0. Is that not the case?

-Suren



On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma <sc...@gmail.com>wrote:

> Hi Ishaaq,
>
> answers inline from what I know, I had like to be corrected though.
>
> On Tue, Apr 15, 2014 at 5:58 PM, ishaaq <is...@gmail.com> wrote:
>
>> Hi all,
>> I am evaluating Spark to use here at my work.
>>
>> We have an existing Hadoop 1.x install which I planning to upgrade to
>> Hadoop
>> 2.3.
>>
>> This is not really a requirement for spark, if you are doing for some
> other reason great !
>
>
>> I am trying to work out whether I should install YARN or simply just
>> setup a
>> Spark standalone cluster. We already use ZooKeeper so it isn't a problem
>> to
>> setup HA. I am puzzled however as to how the Spark nodes can coordinate on
>> data locality - i.e., assuming I install the nodes on the same machines as
>> the DFS data nodes, I don't understand how Spark can work out which nodes
>> should get which splits of the jobs?
>>
>> This happens exactly the same way hadoop's mapreduce figures out data
> locality. Since we support hadoop's inputformats(which also has the
> information on how data is partitioned) etc. So having spark workers share
> the same nodes as your DFS is a good idea.
>
>
>> Anyway, my bigger question remains: YARN or standalone? Which is the more
>> stable option currently? Which is the more future-proof option?
>>
>>
> Well I think standalone is stable enough for all purposes and Spark's yarn
> support has been keeping up with latest hadoop versions too. It depends on
> the fact that if you are already using yarn and don't want the hassle of
> setting up another cluster manager you can probably prefer yarn.
>
>
>> Thanks,
>> Ishaaq
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/standalone-vs-YARN-tp4271.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: standalone vs YARN

Posted by Prashant Sharma <sc...@gmail.com>.

Hi Ishaaq,

answers inline from what I know, I had like to be corrected though.

On Tue, Apr 15, 2014 at 5:58 PM, ishaaq <is...@gmail.com> wrote:

> Hi all,
> I am evaluating Spark to use here at my work.
>
> We have an existing Hadoop 1.x install which I planning to upgrade to
> Hadoop
> 2.3.
>
> This is not really a requirement for spark, if you are doing for some
other reason great !


> I am trying to work out whether I should install YARN or simply just setup
> a
> Spark standalone cluster. We already use ZooKeeper so it isn't a problem to
> setup HA. I am puzzled however as to how the Spark nodes can coordinate on
> data locality - i.e., assuming I install the nodes on the same machines as
> the DFS data nodes, I don't understand how Spark can work out which nodes
> should get which splits of the jobs?
>
> This happens exactly the same way hadoop's mapreduce figures out data
locality. Since we support hadoop's inputformats(which also has the
information on how data is partitioned) etc. So having spark workers share
the same nodes as your DFS is a good idea.


> Anyway, my bigger question remains: YARN or standalone? Which is the more
> stable option currently? Which is the more future-proof option?
>
>
Well I think standalone is stable enough for all purposes and Spark's yarn
support has been keeping up with latest hadoop versions too. It depends on
the fact that if you are already using yarn and don't want the hassle of
setting up another cluster manager you can probably prefer yarn.


> Thanks,
> Ishaaq
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/standalone-vs-YARN-tp4271.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>