You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-commits@hadoop.apache.org by cd...@apache.org on 2009/11/03 09:02:25 UTC

svn commit: r832330 - in /hadoop/mapreduce/branches/branch-0.21: CHANGES.txt src/docs/src/documentation/content/xdocs/distcp.xml

Author: cdouglas
Date: Tue Nov  3 08:02:24 2009
New Revision: 832330

URL: http://svn.apache.org/viewvc?rev=832330&view=rev
Log:
MAPREDUCE-971. Document use of distcp when copying to s3, managing timeouts
in particular. Contributed by Aaron Kimball

Modified:
    hadoop/mapreduce/branches/branch-0.21/CHANGES.txt
    hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/distcp.xml

Modified: hadoop/mapreduce/branches/branch-0.21/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/CHANGES.txt?rev=832330&r1=832329&r2=832330&view=diff
==============================================================================
--- hadoop/mapreduce/branches/branch-0.21/CHANGES.txt (original)
+++ hadoop/mapreduce/branches/branch-0.21/CHANGES.txt Tue Nov  3 08:02:24 2009
@@ -422,6 +422,9 @@
     MAPREDUCE-1012. Mark Context interfaces as public evolving. (Tom White via
     cdouglas)
 
+    MAPREDUCE-971. Document use of distcp when copying to s3, managing timeouts
+    in particular. (Aaron Kimball via cdouglas)
+
   BUG FIXES
 
     MAPREDUCE-1089. Fix NPE in fair scheduler preemption when tasks are  

Modified: hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/distcp.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/distcp.xml?rev=832330&r1=832329&r2=832330&view=diff
==============================================================================
--- hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/distcp.xml (original)
+++ hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/distcp.xml Tue Nov  3 08:02:24 2009
@@ -317,6 +317,36 @@
       </section>
 
       <section>
+        <title>Copying to S3</title>
+
+        <p>DistCp can be used to copy data between HDFS and other filesystems,
+        including those backed by S3. The <code>s3n</code> FileSystem
+        implementation allows DistCp (and Hadoop in general) to use an S3
+        bucket as a source or target for transfers. To transfer data from
+        HDFS to an S3 bucket, invoke DistCp using arguments like the following:
+        </p>
+<source>
+bash$ hadoop distcp hdfs://nn:8020/foo/bar \
+    s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@&lt;bucket&gt;/foo/bar
+</source>
+
+        <p><code>$AWS_ACCESS_KEY_ID</code> and
+        <code>$AWS_SECRET_ACCESS_KEY</code> are environment variables holding
+        S3 access credentials.</p>
+
+        <p>Some FileSystem operations take longer on S3 than on HDFS. If you
+        are transferring large files to S3 (e.g., 1 GB and up), you may
+        experience timeouts during your job. To prevent this, you should set
+        the task timeout to a larger interval than is typically used:
+        </p>
+<source>
+bash$ hadoop distcp -D mapred.task.timeout=1800000 \
+    hdfs://nn:8020/foo/bar \
+    s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@&lt;bucket&gt;/foo/bar
+</source>
+      </section>
+
+      <section>
         <title>MapReduce and Other Side-effects</title>
 
         <p>As has been mentioned in the preceding, should a map fail to copy