You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bigtop.apache.org by jay vyas <ja...@gmail.com> on 2015/02/20 02:45:54 UTC

Testing spark HDFS ... anything we should think about?

hi folks.

is anyone planning to use spark on yarn or spark w/ hdfs in bigtop?  I
havent tried either...

- anyone using spark <-> HDFS in bigtop ? Do we need to update any spark
configs to do so ?
- we want spark to run  on yarn ? standalone ?

im spinning some VMs up now, ill let folks know if it works.
-- 
jay vyas

Re: Testing spark HDFS ... anything we should think about?

Posted by jay vyas <ja...@gmail.com>.

Spark <-> HDFS integration works great.  just confirmed it.

1) Build spark (gradlew spark-yum)
2) vagrant up (add spark to vagrant conf file)
3) hadoop fs -put /etc/passwd /etc/passwd
4) val lines = sc.textFile("/tmp/passwd")

and then you can do lines.collect ... which prints it out.



On Fri, Feb 20, 2015 at 7:36 PM, jay vyas <ja...@gmail.com>
wrote:

> no prob im verifying !
>
> On Fri, Feb 20, 2015 at 7:17 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
>> On Fri, Feb 20, 2015 at 02:07PM, jay vyas wrote:
>> > yes, i think thats what he means, b/c when running on yarn, you read in
>> the
>> > conf from hadoop_conf, and you manually send jars like
>> spark-examples.jar
>> > (which would, otherwise, be available to workers if you had spark
>> instaled
>> > on all nodes).
>> >
>> > im okay w/ either (standalone, yarn, mesos, whatever) spark deplopyment,
>> > but we should probably pick one :)
>> >
>> > for now, at a minimum,  we want to make sure we are able to at least
>> > leverage HDFS properly, even if we just run standalone spark
>>
>> I don't think it is an issue, really. Or at least it wasn't when we did
>> Spark
>> initially. Would be great if someone is willing to verify - I have no
>> cycles
>> for Spark anymore, honestly.
>>
>> Cos
>>
>> > On Fri, Feb 20, 2015 at 1:58 PM, Konstantin Boudnik <co...@apache.org>
>> wrote:
>> >
>> > > On Fri, Feb 20, 2015 at 02:21PM, Evans Ye wrote:
>> > > > I don't have spark expertise, but here're some points I'm thinking
>> about.
>> > > > IIRC, spark standalone do not support Kerberos. And the benefit of
>> > > > deploying spark on yarn should be that you don't need to maintain
>> > > packages
>> > > > by your own on hundreds of node cluster.
>> > >
>> > > Could you clarify what you mean by this? Are you referring that you
>> won't
>> > > need
>> > > to install spark-worker on the cluster's nodes?
>> > >
>> > > Cos
>> > >
>> > > > Not sure if there're downsides. Just want to add some points:)
>> > > >
>> > > > 2015-02-20 9:49 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
>> > > >
>> > > > > They way we're deploying spark is in the standalone mode - I never
>> > > seen any
>> > > > > value in using YARN for that, but I guess it's just me.
>> > > > >
>> > > > > HDFS use comes with no hassle, AFAIR, the way we setup it up. But
>> my
>> > > > > knowledge
>> > > > > might be a bit outdated...
>> > > > >
>> > > > > Cos
>> > > > >
>> > > > > On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
>> > > > > > hi folks.
>> > > > > >
>> > > > > > is anyone planning to use spark on yarn or spark w/ hdfs in
>> bigtop?
>> > > I
>> > > > > > havent tried either...
>> > > > > >
>> > > > > > - anyone using spark <-> HDFS in bigtop ? Do we need to update
>> any
>> > > spark
>> > > > > > configs to do so ?
>> > > > > > - we want spark to run  on yarn ? standalone ?
>> > > > > >
>> > > > > > im spinning some VMs up now, ill let folks know if it works.
>> > > > > > --
>> > > > > > jay vyas
>> > > > >
>> > >
>> >
>> >
>> >
>> > --
>> > jay vyas
>>
>
>
>
> --
> jay vyas
>



-- 
jay vyas

Re: Testing spark HDFS ... anything we should think about?

Posted by jay vyas <ja...@gmail.com>.

no prob im verifying !

On Fri, Feb 20, 2015 at 7:17 PM, Konstantin Boudnik <co...@apache.org> wrote:

> On Fri, Feb 20, 2015 at 02:07PM, jay vyas wrote:
> > yes, i think thats what he means, b/c when running on yarn, you read in
> the
> > conf from hadoop_conf, and you manually send jars like spark-examples.jar
> > (which would, otherwise, be available to workers if you had spark
> instaled
> > on all nodes).
> >
> > im okay w/ either (standalone, yarn, mesos, whatever) spark deplopyment,
> > but we should probably pick one :)
> >
> > for now, at a minimum,  we want to make sure we are able to at least
> > leverage HDFS properly, even if we just run standalone spark
>
> I don't think it is an issue, really. Or at least it wasn't when we did
> Spark
> initially. Would be great if someone is willing to verify - I have no
> cycles
> for Spark anymore, honestly.
>
> Cos
>
> > On Fri, Feb 20, 2015 at 1:58 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
> >
> > > On Fri, Feb 20, 2015 at 02:21PM, Evans Ye wrote:
> > > > I don't have spark expertise, but here're some points I'm thinking
> about.
> > > > IIRC, spark standalone do not support Kerberos. And the benefit of
> > > > deploying spark on yarn should be that you don't need to maintain
> > > packages
> > > > by your own on hundreds of node cluster.
> > >
> > > Could you clarify what you mean by this? Are you referring that you
> won't
> > > need
> > > to install spark-worker on the cluster's nodes?
> > >
> > > Cos
> > >
> > > > Not sure if there're downsides. Just want to add some points:)
> > > >
> > > > 2015-02-20 9:49 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
> > > >
> > > > > They way we're deploying spark is in the standalone mode - I never
> > > seen any
> > > > > value in using YARN for that, but I guess it's just me.
> > > > >
> > > > > HDFS use comes with no hassle, AFAIR, the way we setup it up. But
> my
> > > > > knowledge
> > > > > might be a bit outdated...
> > > > >
> > > > > Cos
> > > > >
> > > > > On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
> > > > > > hi folks.
> > > > > >
> > > > > > is anyone planning to use spark on yarn or spark w/ hdfs in
> bigtop?
> > > I
> > > > > > havent tried either...
> > > > > >
> > > > > > - anyone using spark <-> HDFS in bigtop ? Do we need to update
> any
> > > spark
> > > > > > configs to do so ?
> > > > > > - we want spark to run  on yarn ? standalone ?
> > > > > >
> > > > > > im spinning some VMs up now, ill let folks know if it works.
> > > > > > --
> > > > > > jay vyas
> > > > >
> > >
> >
> >
> >
> > --
> > jay vyas
>



-- 
jay vyas

Re: Testing spark HDFS ... anything we should think about?

Posted by Konstantin Boudnik <co...@apache.org>.

On Fri, Feb 20, 2015 at 02:07PM, jay vyas wrote:
> yes, i think thats what he means, b/c when running on yarn, you read in the
> conf from hadoop_conf, and you manually send jars like spark-examples.jar
> (which would, otherwise, be available to workers if you had spark instaled
> on all nodes).
> 
> im okay w/ either (standalone, yarn, mesos, whatever) spark deplopyment,
> but we should probably pick one :)
> 
> for now, at a minimum,  we want to make sure we are able to at least
> leverage HDFS properly, even if we just run standalone spark

I don't think it is an issue, really. Or at least it wasn't when we did Spark
initially. Would be great if someone is willing to verify - I have no cycles
for Spark anymore, honestly.

Cos

> On Fri, Feb 20, 2015 at 1:58 PM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> > On Fri, Feb 20, 2015 at 02:21PM, Evans Ye wrote:
> > > I don't have spark expertise, but here're some points I'm thinking about.
> > > IIRC, spark standalone do not support Kerberos. And the benefit of
> > > deploying spark on yarn should be that you don't need to maintain
> > packages
> > > by your own on hundreds of node cluster.
> >
> > Could you clarify what you mean by this? Are you referring that you won't
> > need
> > to install spark-worker on the cluster's nodes?
> >
> > Cos
> >
> > > Not sure if there're downsides. Just want to add some points:)
> > >
> > > 2015-02-20 9:49 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
> > >
> > > > They way we're deploying spark is in the standalone mode - I never
> > seen any
> > > > value in using YARN for that, but I guess it's just me.
> > > >
> > > > HDFS use comes with no hassle, AFAIR, the way we setup it up. But my
> > > > knowledge
> > > > might be a bit outdated...
> > > >
> > > > Cos
> > > >
> > > > On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
> > > > > hi folks.
> > > > >
> > > > > is anyone planning to use spark on yarn or spark w/ hdfs in bigtop?
> > I
> > > > > havent tried either...
> > > > >
> > > > > - anyone using spark <-> HDFS in bigtop ? Do we need to update any
> > spark
> > > > > configs to do so ?
> > > > > - we want spark to run  on yarn ? standalone ?
> > > > >
> > > > > im spinning some VMs up now, ill let folks know if it works.
> > > > > --
> > > > > jay vyas
> > > >
> >
> 
> 
> 
> -- 
> jay vyas

Re: Testing spark HDFS ... anything we should think about?

Posted by jay vyas <ja...@gmail.com>.

yes, i think thats what he means, b/c when running on yarn, you read in the
conf from hadoop_conf, and you manually send jars like spark-examples.jar
(which would, otherwise, be available to workers if you had spark instaled
on all nodes).

im okay w/ either (standalone, yarn, mesos, whatever) spark deplopyment,
but we should probably pick one :)

for now, at a minimum,  we want to make sure we are able to at least
leverage HDFS properly, even if we just run standalone spark




On Fri, Feb 20, 2015 at 1:58 PM, Konstantin Boudnik <co...@apache.org> wrote:

> On Fri, Feb 20, 2015 at 02:21PM, Evans Ye wrote:
> > I don't have spark expertise, but here're some points I'm thinking about.
> > IIRC, spark standalone do not support Kerberos. And the benefit of
> > deploying spark on yarn should be that you don't need to maintain
> packages
> > by your own on hundreds of node cluster.
>
> Could you clarify what you mean by this? Are you referring that you won't
> need
> to install spark-worker on the cluster's nodes?
>
> Cos
>
> > Not sure if there're downsides. Just want to add some points:)
> >
> > 2015-02-20 9:49 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
> >
> > > They way we're deploying spark is in the standalone mode - I never
> seen any
> > > value in using YARN for that, but I guess it's just me.
> > >
> > > HDFS use comes with no hassle, AFAIR, the way we setup it up. But my
> > > knowledge
> > > might be a bit outdated...
> > >
> > > Cos
> > >
> > > On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
> > > > hi folks.
> > > >
> > > > is anyone planning to use spark on yarn or spark w/ hdfs in bigtop?
> I
> > > > havent tried either...
> > > >
> > > > - anyone using spark <-> HDFS in bigtop ? Do we need to update any
> spark
> > > > configs to do so ?
> > > > - we want spark to run  on yarn ? standalone ?
> > > >
> > > > im spinning some VMs up now, ill let folks know if it works.
> > > > --
> > > > jay vyas
> > >
>



-- 
jay vyas

Re: Testing spark HDFS ... anything we should think about?

Posted by Konstantin Boudnik <co...@apache.org>.

On Fri, Feb 20, 2015 at 02:21PM, Evans Ye wrote:
> I don't have spark expertise, but here're some points I'm thinking about.
> IIRC, spark standalone do not support Kerberos. And the benefit of
> deploying spark on yarn should be that you don't need to maintain packages
> by your own on hundreds of node cluster.

Could you clarify what you mean by this? Are you referring that you won't need
to install spark-worker on the cluster's nodes?

Cos

> Not sure if there're downsides. Just want to add some points:)
> 
> 2015-02-20 9:49 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
> 
> > They way we're deploying spark is in the standalone mode - I never seen any
> > value in using YARN for that, but I guess it's just me.
> >
> > HDFS use comes with no hassle, AFAIR, the way we setup it up. But my
> > knowledge
> > might be a bit outdated...
> >
> > Cos
> >
> > On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
> > > hi folks.
> > >
> > > is anyone planning to use spark on yarn or spark w/ hdfs in bigtop?  I
> > > havent tried either...
> > >
> > > - anyone using spark <-> HDFS in bigtop ? Do we need to update any spark
> > > configs to do so ?
> > > - we want spark to run  on yarn ? standalone ?
> > >
> > > im spinning some VMs up now, ill let folks know if it works.
> > > --
> > > jay vyas
> >

Re: Testing spark HDFS ... anything we should think about?

Posted by Evans Ye <ev...@apache.org>.

I don't have spark expertise, but here're some points I'm thinking about.
IIRC, spark standalone do not support Kerberos. And the benefit of
deploying spark on yarn should be that you don't need to maintain packages
by your own on hundreds of node cluster.
Not sure if there're downsides. Just want to add some points:)

2015-02-20 9:49 GMT+08:00 Konstantin Boudnik <co...@apache.org>:

> They way we're deploying spark is in the standalone mode - I never seen any
> value in using YARN for that, but I guess it's just me.
>
> HDFS use comes with no hassle, AFAIR, the way we setup it up. But my
> knowledge
> might be a bit outdated...
>
> Cos
>
> On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
> > hi folks.
> >
> > is anyone planning to use spark on yarn or spark w/ hdfs in bigtop?  I
> > havent tried either...
> >
> > - anyone using spark <-> HDFS in bigtop ? Do we need to update any spark
> > configs to do so ?
> > - we want spark to run  on yarn ? standalone ?
> >
> > im spinning some VMs up now, ill let folks know if it works.
> > --
> > jay vyas
>

Re: Testing spark HDFS ... anything we should think about?

Posted by Konstantin Boudnik <co...@apache.org>.

They way we're deploying spark is in the standalone mode - I never seen any
value in using YARN for that, but I guess it's just me. 

HDFS use comes with no hassle, AFAIR, the way we setup it up. But my knowledge
might be a bit outdated...

Cos

On Thu, Feb 19, 2015 at 08:45PM, jay vyas wrote:
> hi folks.
> 
> is anyone planning to use spark on yarn or spark w/ hdfs in bigtop?  I
> havent tried either...
> 
> - anyone using spark <-> HDFS in bigtop ? Do we need to update any spark
> configs to do so ?
> - we want spark to run  on yarn ? standalone ?
> 
> im spinning some VMs up now, ill let folks know if it works.
> -- 
> jay vyas