You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Steve Sapovits <ss...@invitemedia.com> on 2008/03/01 18:10:29 UTC

Amazon S3 questions

I have some confusion over the use of Amazon S3 as storage.

I was looking at the fs.default.name as the name node -- a host and a port
the client uses to ask the name node to perform DFS services.  But for Amazon
S3 you give it an S3 bucket URL, which is really just a direct pointer to the 
storage.  So it seems fs.default.name is really just a storage setting that happens
to be a service (host/port) for HDFS.  Even though S3 is also a service, it can't
also be a name node.

If that's true, where is the host/port of the name node configured separate from
fs.default.name?  It would seem no matter what the underlying storage you'd need
a name node service somewhere.

I'm probably missing something bigger here ...

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Amazon S3 questions

Posted by Steve Sapovits <ss...@invitemedia.com>.

Bradford Stephens wrote:

> What sort of performance hit is there for using S3 vs.  a local cluster?

It probably only makes sense speed-wise if you're running on EC2.  S3 access
from EC2 is a lot faster than accessing it from outside the Amazon cloud.

If you run on EC2, S3 is essentially your persistent storage.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Amazon S3 questions

Posted by Aaron Kimball <ak...@cs.washington.edu>.

Tests of S3--EC2 transfer speed using curl have shown that gigabit 
ethernet speeds can be reached between nodes but only when multiple 
transfers are overlapped. See: 
http://info.rightscale.com/2007/11/29/network-performance-in-ec2-and-s3

- Aaron

Toby DiPasquale wrote:
> On Sat, Mar 1, 2008 at 4:19 PM, Bradford Stephens
> <br...@gmail.com> wrote:
>> What sort of performance hit is there for using S3 vs.  a local cluster?
> 
> In my experience on EC2 using Hadoop, S3-backed HDFS is 25-33% slower
> than disk-backed HDFS. That was using 0.14.x, however, so it might be
> better now...
>

Re: Amazon S3 questions

Posted by Toby DiPasquale <co...@gmail.com>.

On Sat, Mar 1, 2008 at 4:19 PM, Bradford Stephens
<br...@gmail.com> wrote:
> What sort of performance hit is there for using S3 vs.  a local cluster?

In my experience on EC2 using Hadoop, S3-backed HDFS is 25-33% slower
than disk-backed HDFS. That was using 0.14.x, however, so it might be
better now...

-- 
Toby DiPasquale

Re: Amazon S3 questions

Posted by Bradford Stephens <br...@gmail.com>.

What sort of performance hit is there for using S3 vs.  a local cluster?

On Sat, Mar 1, 2008 at 1:09 PM, Steve Sapovits
<ss...@invitemedia.com> wrote:
>
>  One other note: When you use S3 URIs, you get a "port out of range" error
>  on startup but that doesn't appear to be fatal.  I spent a few hours on that
>  one before I realized it didn't seem to matter.  It seems like the S3 URI format
>  where ':' is used to separate ID and secret key is confusing someone.
>
>
>
>  --
>  Steve Sapovits
>  Invite Media  -  http://www.invitemedia.com
>  ssapovits@invitemedia.com
>
>

Re: Amazon S3 questions

Posted by Steve Sapovits <ss...@invitemedia.com>.

>> do you have an underscore in your bucket name?
> 
> Yes I do.

Sorry I was wrong -- no underscores.  Some do but this particular one uses only dashes.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Amazon S3 questions

Posted by Steve Sapovits <ss...@invitemedia.com>.

Chris K Wensel wrote:

> do you have an underscore in your bucket name?

Yes I do.

Here's a sample error message/stack trace.  This is version 0.16.0:

localhost: Exception in thread "main" java.lang.IllegalArgumentException: port out of range:-1
localhost:      at java.net.InetSocketAddress.<init>(InetSocketAddress.java:118)
localhost:      at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:125)
localhost:      at org.apache.hadoop.dfs.SecondaryNameNode.<init>(SecondaryNameNode.java:94)
localhost:      at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:492)

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Amazon S3 questions

Posted by Chris K Wensel <ch...@wensel.net>.

do you have an underscore in your bucket name?

On Mar 2, 2008, at 9:24 AM, Tom White wrote:

>> One other note: When you use S3 URIs, you get a "port out of range"  
>> error
>> on startup but that doesn't appear to be fatal.  I spent a few  
>> hours on that
>> one before I realized it didn't seem to matter.  It seems like the  
>> S3 URI format
>> where ':' is used to separate ID and secret key is confusing someone.
>
> Do you have a stacktrace for this? Sounds like something we could
> improve, if only by printing a warning message.
>
> Tom

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/

Re: Amazon S3 questions

Posted by Tom White <to...@gmail.com>.

>  One other note: When you use S3 URIs, you get a "port out of range" error
>  on startup but that doesn't appear to be fatal.  I spent a few hours on that
>  one before I realized it didn't seem to matter.  It seems like the S3 URI format
>  where ':' is used to separate ID and secret key is confusing someone.

Do you have a stacktrace for this? Sounds like something we could
improve, if only by printing a warning message.

Tom

Re: Amazon S3 questions

Posted by Steve Sapovits <ss...@invitemedia.com>.

One other note: When you use S3 URIs, you get a "port out of range" error
on startup but that doesn't appear to be fatal.  I spent a few hours on that 
one before I realized it didn't seem to matter.  It seems like the S3 URI format
where ':' is used to separate ID and secret key is confusing someone.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Amazon S3 questions

Posted by Steve Sapovits <ss...@invitemedia.com>.

Tom White wrote:

> From the client's point of view fs.default.name sets the default
> filesystem, and is used to resolve paths that don't specify a
> protocol. You can always use a fully qualified URI to specify the path
> e.g. s3://bucket/a/b or hdfs://nn/a/b. This allows you to e.g. e.g.
> take map inputs from HDFS and write reduce outputs to S3.
> 
> For HDFS the setting of fs.default.name in hadoop-site.xml determines
> the host and port for the namenode.
> 
> Does this help? How are you trying to use S3 by the way?

Yup - I got that far.  It looks like with S3 there is no real name node or data
node cluster -- that S3 distribution is used instead (sort of directly).  That's
where my question was.  I like that if that's the case.  Does that make sense?

We will probably be running at least one version of a log writer/map-reducer
on EC2/S3.  Basically, large volumes of data related to a specific type of 
problem that we map-reduce for analysis.  We've been playing with Pig on
top of map-reduce as well.  Good stuff.

The only gotcha I see:  We wrote (extended really) a SWIG wrapper on top
of the C libhdfs library so we could interface to Python.  It looks like the libhdfs
connect logic isn't using the URI schemes 100% correctly -- I doubt S3 will
work through there.  But that looks like an easy fix if that's the case (I think).
Testing that next ...

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Amazon S3 questions

Posted by Tom White <to...@gmail.com>.

>From the client's point of view fs.default.name sets the default
filesystem, and is used to resolve paths that don't specify a
protocol. You can always use a fully qualified URI to specify the path
e.g. s3://bucket/a/b or hdfs://nn/a/b. This allows you to e.g. e.g.
take map inputs from HDFS and write reduce outputs to S3.

For HDFS the setting of fs.default.name in hadoop-site.xml determines
the host and port for the namenode.

Does this help? How are you trying to use S3 by the way?

Tom

On 01/03/2008, Steve Sapovits <ss...@invitemedia.com> wrote:
>
> I have some confusion over the use of Amazon S3 as storage.
>
> I was looking at the fs.default.name as the name node -- a host and a port
> the client uses to ask the name node to perform DFS services.  But for Amazon
> S3 you give it an S3 bucket URL, which is really just a direct pointer to the
> storage.  So it seems fs.default.name is really just a storage setting that happens
> to be a service (host/port) for HDFS.  Even though S3 is also a service, it can't
> also be a name node.
>
> If that's true, where is the host/port of the name node configured separate from
> fs.default.name?  It would seem no matter what the underlying storage you'd need
> a name node service somewhere.
>
> I'm probably missing something bigger here ...
>
> --
> Steve Sapovits
> Invite Media  -  http://www.invitemedia.com
> ssapovits@invitemedia.com
>
>