You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2007/01/07 23:09:21 UTC

[Lucene-hadoop Wiki] Update of "AmazonS3" by TomWhite

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by TomWhite:
http://wiki.apache.org/lucene-hadoop/AmazonS3

New page:
[http://aws.amazon.com/s3 Amazon S3] (Simple Storage Service) is a data storage service. You are billed
monthly for storage and data transfer. Transfer between S3 and Self:AmazonEC2 is free. This makes use of
S3 attractive for Hadoop users who run clusters on EC2.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS
(i.e. using it as a reliable distributed filesystem with support for very large files)
or as a convenient repository for data input to and output from Map/Reduce. In the second case
HDFS is still used for the Map/Reduce phase.

S3 support was introduced in Hadoop 0.10 ([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]),
but it needs the patch in [http://issues.apache.org/jira/browse/HADOOP-857 HADOOP-857] to work properly.
The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with the
Hadoop CopyFile tool.
(Hopefully these patches will be integrated in the next release.)

= Setting up hadoop to use S3 as a replacement for HDFS =

Put the following in ''conf/hadoop-site.xml'' to set the default filesystem to be S3:

{{{
<property>
  <name>fs.default.name</name>
  <value>s3://BUCKET</value>
</property>

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>ID</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>SECRET</value>
</property>
}}}

Alternatively, you can put the access key ID and the secret access key into the S3 URI as the user info:

{{{
<property>
  <name>fs.default.name</name>
  <value>s3://ID:SECRET@BUCKET</value>
</property>
}}}

Note that since the secret
access key can contain slashes, you must remember to escape them by replacing each slash `/` with the string `%2F`.
Keys specified in the URI take precedence over any specified using the properties `fs.s3.awsAccessKeyId` and
`fs.s3.awsSecretAccessKey`.

Running the Map/Reduce demo in the [http://lucene.apache.org/hadoop/api/index.html Hadoop API Documentation] using
S3 is now a matter of running:

{{{
mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
bin/hadoop fs -get output output
cat output/*
}}}

To run in distributed mode you only need to run a JobTracker - the HDFS NameNode is unnecessary.

= Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce =

The idea here is to put your input on S3, then transfer it to HDFS using 
the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied to S3
as input to a further job, or retrieved as a final result.

[More instruction will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.]