You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2007/01/07 23:09:21 UTC
[Lucene-hadoop Wiki] Update of "AmazonS3" by TomWhite
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by TomWhite:
http://wiki.apache.org/lucene-hadoop/AmazonS3
New page:
[http://aws.amazon.com/s3 Amazon S3] (Simple Storage Service) is a data storage service. You are billed
monthly for storage and data transfer. Transfer between S3 and Self:AmazonEC2 is free. This makes use of
S3 attractive for Hadoop users who run clusters on EC2.
There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS
(i.e. using it as a reliable distributed filesystem with support for very large files)
or as a convenient repository for data input to and output from Map/Reduce. In the second case
HDFS is still used for the Map/Reduce phase.
S3 support was introduced in Hadoop 0.10 ([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]),
but it needs the patch in [http://issues.apache.org/jira/browse/HADOOP-857 HADOOP-857] to work properly.
The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with the
Hadoop CopyFile tool.
(Hopefully these patches will be integrated in the next release.)
= Setting up hadoop to use S3 as a replacement for HDFS =
Put the following in ''conf/hadoop-site.xml'' to set the default filesystem to be S3:
{{{
<property>
<name>fs.default.name</name>
<value>s3://BUCKET</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
}}}
Alternatively, you can put the access key ID and the secret access key into the S3 URI as the user info:
{{{
<property>
<name>fs.default.name</name>
<value>s3://ID:SECRET@BUCKET</value>
</property>
}}}
Note that since the secret
access key can contain slashes, you must remember to escape them by replacing each slash `/` with the string `%2F`.
Keys specified in the URI take precedence over any specified using the properties `fs.s3.awsAccessKeyId` and
`fs.s3.awsSecretAccessKey`.
Running the Map/Reduce demo in the [http://lucene.apache.org/hadoop/api/index.html Hadoop API Documentation] using
S3 is now a matter of running:
{{{
mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
bin/hadoop fs -get output output
cat output/*
}}}
To run in distributed mode you only need to run a JobTracker - the HDFS NameNode is unnecessary.
= Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce =
The idea here is to put your input on S3, then transfer it to HDFS using
the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied to S3
as input to a further job, or retrieved as a final result.
[More instruction will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.]