You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/01/22 23:40:09 UTC

Using persistent hdfs on spark ec2 instanes

Hi,

1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this to
persistent hdfs?

2. By default persistent hdfs server is not up, is this meant to be like
this?

Unless I'm missing something, the docs (
https://spark.incubator.apache.org/docs/0.8.1/ec2-scripts.html) do not
point to these issues.

Re: Using persistent hdfs on spark ec2 instanes

Posted by Patrick Wendell <pw...@gmail.com>.
> 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this to
> persistent hdfs?
You can stop the ephemeral one using

/root/ephemeral-hdfs/bin/stop-dfs.sh

and start the persistent one using

 /root/persistent-hdfs/bin/start-dfs.sh

> 2. By default persistent hdfs server is not up, is this meant to be like
> this?

Yes - it starts only an ephemeral one:

"The spark-ec2 script already sets up a HDFS instance for you. It’s
installed in /root/ephemeral-hdfs"

Re: Using persistent hdfs on spark ec2 instanes

Posted by Aureliano Buendia <bu...@gmail.com>.
On Thu, Jan 23, 2014 at 12:41 AM, Patrick Wendell <pw...@gmail.com>wrote:

> It should work correctly and yes, it starts and stops on port 9010.
> You'll need to use "hdfs://<master-hostname>:9010/path/to/whatever" to
> access files from Spark. Is that what you are asking about?
>

Actually, when I tried:

myRdd.saveAsTextFile("hdfs://<master-hostname>:9000/path/to/whatever")

It threw an error, as spark already tries to add the
"hdfs://<master-hostname>:9000/" prefix to the path.

So I use:

myRdd.saveAsTextFile("/path/to/whatever")

and it ends up in the ephemeral hdfs. That's why I asked if spark needs a
configuration to work with the persistent hdfs.

Of course, as you mentioned, the other way is to change persistent port to
9000.


>
> On Wed, Jan 22, 2014 at 4:36 PM, Aureliano Buendia <bu...@gmail.com>
> wrote:
> > peristent-hdfs server is set to 9010 port, instead of 9000. Does spark
> need
> > more config for this?
> >
> >
> > On Thu, Jan 23, 2014 at 12:26 AM, Patrick Wendell <pw...@gmail.com>
> > wrote:
> >>
> >> > 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch
> this
> >> > to
> >> > persistent hdfs?
> >> You can stop the ephemeral one using
> >>
> >> /root/ephemeral-hdfs/bin/stop-dfs.sh
> >>
> >> and start the persistent one using
> >>
> >>  /root/persistent-hdfs/bin/start-dfs.sh
> >>
> >> > 2. By default persistent hdfs server is not up, is this meant to be
> like
> >> > this?
> >>
> >> Yes - it starts only an ephemeral one:
> >>
> >> "The spark-ec2 script already sets up a HDFS instance for you. It’s
> >> installed in /root/ephemeral-hdfs"
> >
> >
>

Re: Using persistent hdfs on spark ec2 instanes

Posted by Patrick Wendell <pw...@gmail.com>.
You can change the behavior if you edit core-site.xml in the conf
directory of spark to make 9010 the default filesystem. This is
something where the docs could probably be improved to mention this,
if you have interest in submitting a PR I'd be happy to review it.

If you look the default is to have 9000 as the filesystem:
https://github.com/mesos/spark-ec2/blob/v2/templates/root/spark/conf/core-site.xml

On Wed, Jan 22, 2014 at 4:41 PM, Patrick Wendell <pw...@gmail.com> wrote:
> It should work correctly and yes, it starts and stops on port 9010.
> You'll need to use "hdfs://<master-hostname>:9010/path/to/whatever" to
> access files from Spark. Is that what you are asking about?
>
> On Wed, Jan 22, 2014 at 4:36 PM, Aureliano Buendia <bu...@gmail.com> wrote:
>> peristent-hdfs server is set to 9010 port, instead of 9000. Does spark need
>> more config for this?
>>
>>
>> On Thu, Jan 23, 2014 at 12:26 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>>>
>>> > 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this
>>> > to
>>> > persistent hdfs?
>>> You can stop the ephemeral one using
>>>
>>> /root/ephemeral-hdfs/bin/stop-dfs.sh
>>>
>>> and start the persistent one using
>>>
>>>  /root/persistent-hdfs/bin/start-dfs.sh
>>>
>>> > 2. By default persistent hdfs server is not up, is this meant to be like
>>> > this?
>>>
>>> Yes - it starts only an ephemeral one:
>>>
>>> "The spark-ec2 script already sets up a HDFS instance for you. It’s
>>> installed in /root/ephemeral-hdfs"
>>
>>

Re: Using persistent hdfs on spark ec2 instanes

Posted by Patrick Wendell <pw...@gmail.com>.
It should work correctly and yes, it starts and stops on port 9010.
You'll need to use "hdfs://<master-hostname>:9010/path/to/whatever" to
access files from Spark. Is that what you are asking about?

On Wed, Jan 22, 2014 at 4:36 PM, Aureliano Buendia <bu...@gmail.com> wrote:
> peristent-hdfs server is set to 9010 port, instead of 9000. Does spark need
> more config for this?
>
>
> On Thu, Jan 23, 2014 at 12:26 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
>>
>> > 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this
>> > to
>> > persistent hdfs?
>> You can stop the ephemeral one using
>>
>> /root/ephemeral-hdfs/bin/stop-dfs.sh
>>
>> and start the persistent one using
>>
>>  /root/persistent-hdfs/bin/start-dfs.sh
>>
>> > 2. By default persistent hdfs server is not up, is this meant to be like
>> > this?
>>
>> Yes - it starts only an ephemeral one:
>>
>> "The spark-ec2 script already sets up a HDFS instance for you. It’s
>> installed in /root/ephemeral-hdfs"
>
>

Re: Using persistent hdfs on spark ec2 instanes

Posted by Aureliano Buendia <bu...@gmail.com>.
peristent-hdfs server is set to 9010 port, instead of 9000. Does spark need
more config for this?


On Thu, Jan 23, 2014 at 12:26 AM, Patrick Wendell <pw...@gmail.com>wrote:

> > 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this
> to
> > persistent hdfs?
> You can stop the ephemeral one using
>
> /root/ephemeral-hdfs/bin/stop-dfs.sh
>
> and start the persistent one using
>
>  /root/persistent-hdfs/bin/start-dfs.sh
>
> > 2. By default persistent hdfs server is not up, is this meant to be like
> > this?
>
> Yes - it starts only an ephemeral one:
>
> "The spark-ec2 script already sets up a HDFS instance for you. It’s
> installed in /root/ephemeral-hdfs"
>

Re: Using persistent hdfs on spark ec2 instanes

Posted by Patrick Wendell <pw...@gmail.com>.
> 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this to
> persistent hdfs?
You can stop the ephemeral one using

/root/ephemeral-hdfs/bin/stop-dfs.sh

and start the persistent one using

 /root/persistent-hdfs/bin/start-dfs.sh

> 2. By default persistent hdfs server is not up, is this meant to be like
> this?

Yes - it starts only an ephemeral one:

"The spark-ec2 script already sets up a HDFS instance for you. It’s
installed in /root/ephemeral-hdfs"

Re: Using persistent hdfs on spark ec2 instanes

Posted by Patrick Wendell <pw...@gmail.com>.
> 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this to
> persistent hdfs?
You can stop the ephemeral one using

/root/ephemeral-hdfs/bin/stop-dfs.sh

and start the persistent one using

 /root/persistent-hdfs/bin/start-dfs.sh

> 2. By default persistent hdfs server is not up, is this meant to be like
> this?

Yes - it starts only an ephemeral one:

"The spark-ec2 script already sets up a HDFS instance for you. It’s
installed in /root/ephemeral-hdfs"