You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-commits@hadoop.apache.org by sz...@apache.org on 2010/04/01 22:46:07 UTC
svn commit: r930088 - in /hadoop/mapreduce/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/hadoop_archives.xml

Author: szetszwo
Date: Thu Apr  1 20:46:06 2010
New Revision: 930088

URL: http://svn.apache.org/viewvc?rev=930088&view=rev
Log:
MAPREDUCE-1514.  Add documentation on replication, permissions, new options, limitations and internals of har.  Contributed by mahadev

Modified:
    hadoop/mapreduce/trunk/CHANGES.txt
    hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml

Modified: hadoop/mapreduce/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=930088&r1=930087&r2=930088&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/CHANGES.txt (original)
+++ hadoop/mapreduce/trunk/CHANGES.txt Thu Apr  1 20:46:06 2010
@@ -235,6 +235,9 @@ Trunk (unreleased changes)
     MAPREDUCE-1489. DataDrivenDBInputFormat should not query the database
     when generating only one split. (Aaron Kimball via tomwhite)
 
+    MAPREDUCE-1514.  Add documentation on replication, permissions, new options,
+    limitations and internals of har.  (mahadev via szetszwo)
+
   OPTIMIZATIONS
 
     MAPREDUCE-270. Fix the tasktracker to optionally send an out-of-band

Modified: hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml?rev=930088&r1=930087&r2=930088&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml (original)
+++ hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml Thu Apr  1 20:46:06 2010
@@ -17,38 +17,40 @@
 -->
 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
 <document>
-        <header>
-        <title>Archives Guide</title>
-        </header>
-        <body>
-        <section>
-        <title> What are Hadoop archives? </title>
-        <p>
-        Hadoop archives are special format archives. A Hadoop archive
-        maps to a file system directory. A Hadoop archive always has a *.har
-        extension. A Hadoop archive directory contains metadata (in the form 
-        of _index and _masterindex) and data (part-*) files. The _index file contains
-        the name of the files that are part of the archive and the location
-        within the part files. 
-        </p>
-        </section>
-        
-        <section>
-        <title> How to create an archive? </title>
-        <p>
-        <code>Usage: hadoop archive -archiveName name -p &lt;parent&gt; &lt;src&gt;* &lt;dest&gt;</code>
+	<header>
+		<title>Hadoop Archive Guide</title>
+	</header>
+	<body>
+		<section>
+			<title> What are Hadoop archives?</title>
+			<p> Hadoop archives are special format archives. The main use case of
+				using archives is to reduce the namespace of the NameNode. Hadoop
+				Archives collapses a set of files into a smaller number of files and
+				provides a very efficient and easy interface to access these
+				collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop
+				archive always has a *.har extension.</p>
+		</section>
+		<section>
+			<title> How to create an archive?</title>
+			<p>
+				<code>Usage: hadoop archive -archiveName name -p &lt;parent&gt; &lt;src&gt;* &lt;dest&gt;</code>
         </p>
         <p>
         -archiveName is the name of the archive you would like to create. 
         An example would be foo.har. The name should have a *.har extension. 
-       	The parent argument is to specify the relative path to which the files should be
+       	The parent argument is to specify the relative path to which the files 
+       	should be
        	archived to. Example would be :
         </p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p>
-        Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. 
+        Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to 
+        parent. 
         Note that this is a Map/Reduce job that creates the archives. You would
-        need a map reduce cluster to run this. For a detailed example the later sections. </p>
-        <p> If you just want to archive a single directory /foo/bar then you can just use </p>
-        <p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir </code></p>
+        need a map reduce cluster to run this. For a detailed example the later 
+        sections. </p>
+        <p> If you just want to archive a single directory /foo/bar then you 
+        can just use </p>
+        <p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir 
+        </code></p>
         </section>
         
         <section>
@@ -58,7 +60,8 @@
         commands in the archives work but with a different URI. Also, note that
         archives are immutable. So, rename's, deletes and creates return
         an error. URI for Hadoop Archives is 
-        </p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code></p><p>
+        </p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code>
+        </p><p>
         If no scheme is provided it assumes the underlying filesystem. 
         In that case the URI would look like </p>
         <p><code>har:///archivepath/fileinarchive</code></p>
@@ -66,22 +69,27 @@
 
  		<section>
  		<title> Example on creating and looking up archives </title>
-        <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo </code></p>
+        <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
+         /user/zoo </code></p>
         <p>
-         The above example is creating an archive using /user/hadoop as the relative archive directory.
+         The above example is creating an archive using /user/hadoop as the 
+         relative archive directory.
          The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be 
-        archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input
-        files. If you want to delete the input files after creating the archives (to reduce namespace), you
+        archived in the following file system directory -- /user/zoo/foo.har. 
+        Archiving does not delete the input files. If you want to delete 
+        the input files after creating the archives (to reduce namespace), you
         will have to do it on your own. 
         </p>
 
         <section>
         <title> Looking up files and understanding the -p option </title>
-		 <p> Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have
-		 archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the exmaple above, to see all
-		 the files in the archives you can just run: </p>
+		 <p> Looking up files in hadoop archives is as easy as doing an ls on the 
+		 filesystem. After you have archived the directories /user/hadoop/dir1 and
+		  /user/hadoop/dir2 as in the exmaple above, to see all the files in the 
+		  archives you can just run: </p>
 		 <p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p>
-		 <p> To understand the significance of the -p argument, lets go through the above example again. If you just do
+		 <p> To understand the significance of the -p argument, lets go through the 
+		 above example again. If you just do
 		 an ls (not lsr) on the hadoop archive using </p>
 		 <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
 		 <p>The output should be:</p>
@@ -90,9 +98,11 @@ har:///user/zoo/foo.har/dir1
 har:///user/zoo/foo.har/dir2
 		 </source>
 		 <p> As you can recall the archives were created with the following command </p>
-        <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo </code></p>
+        <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
+         /user/zoo </code></p>
         <p> If we were to change the command to: </p>
-        <p><code>hadoop archive -archiveName foo.har -p /user/  hadoop/dir1 hadoop/dir2 /user/zoo </code></p>
+        <p><code>hadoop archive -archiveName foo.har -p /user/  
+        hadoop/dir1 hadoop/dir2 /user/zoo </code></p>
         <p> then a ls on the hadoop archive using </p>
         <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
         <p>would give you</p>
@@ -101,17 +111,87 @@ har:///user/zoo/foo.har/hadoop/dir1
 har:///user/zoo/foo.har/hadoop/dir2
 		</source>
 		<p>
-		Notice that the archived files have been archived relative to /user/ rather than /user/hadoop.
+		Notice that the archived files have been archived relative to 
+		/user/ rather than /user/hadoop.
 		</p>
 		</section>
 		</section>
 		
 		<section>
-		<title> Using Hadoop Archives with Map Reduce </title> 
-		<p>Using Hadoop Archives in Map Reduce is as easy as specifying a different input filesystem than the default file system.
-		If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for Map Reduce input, all
-		you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system 
-		Map Reduce will be able to use all the logical input files in Hadoop Archives as input.</p>
+		<title> Using Hadoop Archive with Map Reduce </title> 
+		<p>Using Hadoop Archive in Map Reduce is as easy as specifying a
+		 different input filesystem than the default file system.
+		If you have a hadoop archive stored in HDFS in /user/zoo/foo.har 
+		then for using this archive for Map Reduce input, all
+		you need to specify the input directory as har:///user/zoo/foo.har. 
+		Since Hadoop Archives is exposed as a file system Map Reduce will be able
+		 to use all the logical input files in Hadoop Archives as input.</p>
+        </section>
+        
+        <section>
+        <title> File Replication and Permissions of Hadoop Archive </title>
+        <p>
+       	Hadoop Archive currently does not store the file information metadata 
+       	that the files had before they were archived. The file permissions of 
+       	Hadoop Archive created is the default permissions that a user creates 
+       	file with. The file replication of data files in Hadoop Archive is set to
+       	 3 and the metadata information files have a replication factor of 5. You 
+       	 can increase this by increasing the replication factor of files under the har 
+       	 file directory.
+       	On restoration of hadoop archive files using something like:
+       	</p>
+       	<p><code>
+       	distcp har:///path_to_har dest_path
+       	</code></p>
+       	<p>
+       	the restored files will not have the permissions/replication of the original 
+       	files that were archived. 
+        </p>
+        </section> 
+        
+        <section>
+        <title> Creating different block size and part size Hadoop Archive </title>
+        <p>
+        You can create different hadoop block size and part size using the following 
+        options:
+        </p> <p><code>
+        bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ...
+        </code></p>
+        <p> 
+        The above example allows you to set a block size of 512 bytes for part files
+         and part file size of 1K. These numbers are only as examples. Using such low 
+         number for block size and part file size is not advisable at all!
+        </p>  
         </section>
+       
+        <section> 
+        <title> Limitations of Hadoop Archive </title>
+        <p>
+        Currently Hadoop archive do not support input paths with spaces in it. It 
+        throws out an exception in such a case. You can create archives with names 
+        in which space can be replaced by a valid character.
+        Below is an example:
+        </p>
+        <p>
+        <code>
+        bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_" 
+        -archiveName ......
+        </code>
+        </p> 
+	    <p> The above example replaces space with "_" in the archived file names.
+	    </p>
+	    </section>
+        
+        <section> 
+        <title>Internals of Hadoop Archive </title>
+        <p>
+        A Hadoop Archive directory contains metadata (in the form 
+        of _index and _masterindex) and data (part-*) files. The _index file contains
+        the name of the files that are part of the archive and the location
+        within the part files. The _masterindex file stores offsets into the _index 
+        file to make it easier to seek into the _index file for faster lookups.
+        </p>
+        </section>
+        
   </body>
 </document>