You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Sapovits <ss...@invitemedia.com> on 2008/03/01 18:10:29 UTC
Amazon S3 questions
I have some confusion over the use of Amazon S3 as storage.
I was looking at the fs.default.name as the name node -- a host and a port
the client uses to ask the name node to perform DFS services. But for Amazon
S3 you give it an S3 bucket URL, which is really just a direct pointer to the
storage. So it seems fs.default.name is really just a storage setting that happens
to be a service (host/port) for HDFS. Even though S3 is also a service, it can't
also be a name node.
If that's true, where is the host/port of the name node configured separate from
fs.default.name? It would seem no matter what the underlying storage you'd need
a name node service somewhere.
I'm probably missing something bigger here ...
--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com
Re: Amazon S3 questions
Posted by Steve Sapovits <ss...@invitemedia.com>.
Bradford Stephens wrote:
> What sort of performance hit is there for using S3 vs. a local cluster?
It probably only makes sense speed-wise if you're running on EC2. S3 access
from EC2 is a lot faster than accessing it from outside the Amazon cloud.
If you run on EC2, S3 is essentially your persistent storage.
--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com
Re: Amazon S3 questions
Posted by Aaron Kimball <ak...@cs.washington.edu>.
Tests of S3--EC2 transfer speed using curl have shown that gigabit
ethernet speeds can be reached between nodes but only when multiple
transfers are overlapped. See:
http://info.rightscale.com/2007/11/29/network-performance-in-ec2-and-s3
- Aaron
Toby DiPasquale wrote:
> On Sat, Mar 1, 2008 at 4:19 PM, Bradford Stephens
> <br...@gmail.com> wrote:
>> What sort of performance hit is there for using S3 vs. a local cluster?
>
> In my experience on EC2 using Hadoop, S3-backed HDFS is 25-33% slower
> than disk-backed HDFS. That was using 0.14.x, however, so it might be
> better now...
>
Re: Amazon S3 questions
Posted by Toby DiPasquale <co...@gmail.com>.
On Sat, Mar 1, 2008 at 4:19 PM, Bradford Stephens
<br...@gmail.com> wrote:
> What sort of performance hit is there for using S3 vs. a local cluster?
In my experience on EC2 using Hadoop, S3-backed HDFS is 25-33% slower
than disk-backed HDFS. That was using 0.14.x, however, so it might be
better now...
--
Toby DiPasquale
Re: Amazon S3 questions
Posted by Bradford Stephens <br...@gmail.com>.
What sort of performance hit is there for using S3 vs. a local cluster?
On Sat, Mar 1, 2008 at 1:09 PM, Steve Sapovits
<ss...@invitemedia.com> wrote:
>
> One other note: When you use S3 URIs, you get a "port out of range" error
> on startup but that doesn't appear to be fatal. I spent a few hours on that
> one before I realized it didn't seem to matter. It seems like the S3 URI format
> where ':' is used to separate ID and secret key is confusing someone.
>
>
>
> --
> Steve Sapovits
> Invite Media - http://www.invitemedia.com
> ssapovits@invitemedia.com
>
>
Re: Amazon S3 questions
Posted by Steve Sapovits <ss...@invitemedia.com>.
>> do you have an underscore in your bucket name?
>
> Yes I do.
Sorry I was wrong -- no underscores. Some do but this particular one uses only dashes.
--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com
Re: Amazon S3 questions
Posted by Steve Sapovits <ss...@invitemedia.com>.
Chris K Wensel wrote:
> do you have an underscore in your bucket name?
Yes I do.
Here's a sample error message/stack trace. This is version 0.16.0:
localhost: Exception in thread "main" java.lang.IllegalArgumentException: port out of range:-1
localhost: at java.net.InetSocketAddress.<init>(InetSocketAddress.java:118)
localhost: at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:125)
localhost: at org.apache.hadoop.dfs.SecondaryNameNode.<init>(SecondaryNameNode.java:94)
localhost: at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:492)
--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com
Re: Amazon S3 questions
Posted by Chris K Wensel <ch...@wensel.net>.
do you have an underscore in your bucket name?
On Mar 2, 2008, at 9:24 AM, Tom White wrote:
>> One other note: When you use S3 URIs, you get a "port out of range"
>> error
>> on startup but that doesn't appear to be fatal. I spent a few
>> hours on that
>> one before I realized it didn't seem to matter. It seems like the
>> S3 URI format
>> where ':' is used to separate ID and secret key is confusing someone.
>
> Do you have a stacktrace for this? Sounds like something we could
> improve, if only by printing a warning message.
>
> Tom
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
Re: Amazon S3 questions
Posted by Tom White <to...@gmail.com>.
> One other note: When you use S3 URIs, you get a "port out of range" error
> on startup but that doesn't appear to be fatal. I spent a few hours on that
> one before I realized it didn't seem to matter. It seems like the S3 URI format
> where ':' is used to separate ID and secret key is confusing someone.
Do you have a stacktrace for this? Sounds like something we could
improve, if only by printing a warning message.
Tom
Re: Amazon S3 questions
Posted by Steve Sapovits <ss...@invitemedia.com>.
One other note: When you use S3 URIs, you get a "port out of range" error
on startup but that doesn't appear to be fatal. I spent a few hours on that
one before I realized it didn't seem to matter. It seems like the S3 URI format
where ':' is used to separate ID and secret key is confusing someone.
--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com
Re: Amazon S3 questions
Posted by Steve Sapovits <ss...@invitemedia.com>.
Tom White wrote:
> From the client's point of view fs.default.name sets the default
> filesystem, and is used to resolve paths that don't specify a
> protocol. You can always use a fully qualified URI to specify the path
> e.g. s3://bucket/a/b or hdfs://nn/a/b. This allows you to e.g. e.g.
> take map inputs from HDFS and write reduce outputs to S3.
>
> For HDFS the setting of fs.default.name in hadoop-site.xml determines
> the host and port for the namenode.
>
> Does this help? How are you trying to use S3 by the way?
Yup - I got that far. It looks like with S3 there is no real name node or data
node cluster -- that S3 distribution is used instead (sort of directly). That's
where my question was. I like that if that's the case. Does that make sense?
We will probably be running at least one version of a log writer/map-reducer
on EC2/S3. Basically, large volumes of data related to a specific type of
problem that we map-reduce for analysis. We've been playing with Pig on
top of map-reduce as well. Good stuff.
The only gotcha I see: We wrote (extended really) a SWIG wrapper on top
of the C libhdfs library so we could interface to Python. It looks like the libhdfs
connect logic isn't using the URI schemes 100% correctly -- I doubt S3 will
work through there. But that looks like an easy fix if that's the case (I think).
Testing that next ...
--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com
Re: Amazon S3 questions
Posted by Tom White <to...@gmail.com>.
>From the client's point of view fs.default.name sets the default
filesystem, and is used to resolve paths that don't specify a
protocol. You can always use a fully qualified URI to specify the path
e.g. s3://bucket/a/b or hdfs://nn/a/b. This allows you to e.g. e.g.
take map inputs from HDFS and write reduce outputs to S3.
For HDFS the setting of fs.default.name in hadoop-site.xml determines
the host and port for the namenode.
Does this help? How are you trying to use S3 by the way?
Tom
On 01/03/2008, Steve Sapovits <ss...@invitemedia.com> wrote:
>
> I have some confusion over the use of Amazon S3 as storage.
>
> I was looking at the fs.default.name as the name node -- a host and a port
> the client uses to ask the name node to perform DFS services. But for Amazon
> S3 you give it an S3 bucket URL, which is really just a direct pointer to the
> storage. So it seems fs.default.name is really just a storage setting that happens
> to be a service (host/port) for HDFS. Even though S3 is also a service, it can't
> also be a name node.
>
> If that's true, where is the host/port of the name node configured separate from
> fs.default.name? It would seem no matter what the underlying storage you'd need
> a name node service somewhere.
>
> I'm probably missing something bigger here ...
>
> --
> Steve Sapovits
> Invite Media - http://www.invitemedia.com
> ssapovits@invitemedia.com
>
>