You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by om...@apache.org on 2011/03/04 05:07:36 UTC
svn commit: r1077365 [3/5] - in /hadoop/common/branches/branch-0.20-security-patches/src/docs/src/documentation: ./ content/xdocs/ resources/images/

Modified: hadoop/common/branches/branch-0.20-security-patches/src/docs/src/documentation/content/xdocs/hdfs_user_guide.xml
URL: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security-patches/src/docs/src/documentation/content/xdocs/hdfs_user_guide.xml?rev=1077365&r1=1077364&r2=1077365&view=diff
==============================================================================
--- hadoop/common/branches/branch-0.20-security-patches/src/docs/src/documentation/content/xdocs/hdfs_user_guide.xml (original)
+++ hadoop/common/branches/branch-0.20-security-patches/src/docs/src/documentation/content/xdocs/hdfs_user_guide.xml Fri Mar  4 04:07:36 2011
@@ -1,10 +1,11 @@
 <?xml version="1.0"?>
 <!--
-  Copyright 2002-2004 The Apache Software Foundation
-
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
 
@@ -23,7 +24,7 @@
 
   <header>
     <title>
-      HDFS User Guide
+      HDFS Users Guide
     </title>
   </header>
 
@@ -31,9 +32,8 @@
     <section> <title>Purpose</title>
       <p>
  This document is a starting point for users working with
- Hadoop Distributed File System (HDFS) either as a part of a
- <a href="http://hadoop.apache.org/">Hadoop</a>
- cluster or as a stand-alone general purpose distributed file system.
+ Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster  
+ or as a stand-alone general purpose distributed file system.
  While HDFS is designed to "just work" in many environments, a working
  knowledge of HDFS helps greatly with configuration improvements and
  diagnostics on a specific cluster.
@@ -45,7 +45,7 @@
  HDFS is the primary distributed storage used by Hadoop applications. A
  HDFS cluster primarily consists of a NameNode that manages the
  file system metadata and DataNodes that store the actual data. The
- <a href="hdfs_design.html">HDFS Architecture</a> describes HDFS in detail. This user guide primarily deals with 
+ <a href="hdfs_design.html">HDFS Architecture Guide</a> describes HDFS in detail. This user guide primarily deals with 
  the interaction of users and administrators with HDFS clusters. 
  The <a href="images/hdfsarchitecture.gif">HDFS architecture diagram</a> depicts 
  basic interactions among NameNode, the DataNodes, and the clients. 
@@ -60,8 +60,7 @@
     <li>
     	Hadoop, including HDFS, is well suited for distributed storage
     	and distributed processing using commodity hardware. It is fault
-    	tolerant, scalable, and extremely simple to expand.
-    	<a href="mapred_tutorial.html">Map/Reduce</a>,
+    	tolerant, scalable, and extremely simple to expand. MapReduce, 
     	well known for its simplicity and applicability for large set of
     	distributed applications, is an integral part of Hadoop.
     </li>
@@ -113,26 +112,41 @@
     		problems.
     	</li>
     	<li>
-    		Secondary NameNode: performs periodic checkpoints of the 
+    		Secondary NameNode (deprecated): performs periodic checkpoints of the 
     		namespace and helps keep the size of file containing log of HDFS 
     		modifications within certain limits at the NameNode.
+    		Replaced by Checkpoint node.
+    	</li>
+    	<li>
+    		Checkpoint node: performs periodic checkpoints of the namespace and
+    		helps minimize the size of the log stored at the NameNode 
+    		containing changes to the HDFS.
+    		Replaces the role previously filled by the Secondary NameNode. 
+    		NameNode allows multiple Checkpoint nodes simultaneously, 
+    		as long as there are no Backup nodes registered with the system.
+    	</li>
+    	<li>
+    		Backup node: An extension to the Checkpoint node.
+    		In addition to checkpointing it also receives a stream of edits 
+    		from the NameNode and maintains its own in-memory copy of the namespace,
+    		which is always in sync with the active NameNode namespace state.
+    		Only one Backup node may be registered with the NameNode at once.
     	</li>
       </ul>
     </li>
     </ul>
     
-    </section> <section> <title> Pre-requisites </title>
+    </section> <section> <title> Prerequisites </title>
     <p>
- 	The following documents describe installation and set up of a
- 	Hadoop cluster : 
+ 	The following documents describe how to install and set up a Hadoop cluster: 
     </p>
  	<ul>
  	<li>
- 		<a href="quickstart.html">Hadoop Quick Start</a>
+ 		<a href="single_node_setup.html">Single Node Setup</a>
  		for first-time users.
  	</li>
  	<li>
- 		<a href="cluster_setup.html">Hadoop Cluster Setup</a>
+ 		<a href="cluster_setup.html">Cluster Setup</a>
  		for large, distributed clusters.
  	</li>
     </ul>
@@ -160,14 +174,15 @@
       Hadoop includes various shell-like commands that directly
       interact with HDFS and other file systems that Hadoop supports.
       The command
-      <code>bin/hadoop fs -help</code>
+      <code>bin/hdfs dfs -help</code>
       lists the commands supported by Hadoop
       shell. Furthermore, the command
-      <code>bin/hadoop fs -help command-name</code>
+      <code>bin/hdfs dfs -help command-name</code>
       displays more detailed help for a command. These commands support
-      most of the normal files ystem operations like copying files,
+      most of the normal files system operations like copying files,
       changing file permissions, etc. It also supports a few HDFS
-      specific operations like changing replication of files.
+      specific operations like changing replication of files. 
+      For more information see <a href="file_system_shell.html">File System Shell Guide</a>.
      </p>
 
    <section> <title> DFSAdmin Command </title>
@@ -203,18 +218,31 @@
       been marked for decommission. Entires not present in both the lists 
       are decommissioned. 
     </li>
+    <li>
+      <code>-printTopology</code>
+      : Print the topology of the cluster. Display a tree of racks and
+      datanodes attached to the tracks as viewed by the NameNode.
+    </li>
    	</ul>
    	<p>
-   	  For command usage, see <a href="commands_manual.html#dfsadmin">dfsadmin command</a>.
+   	  For command usage, see  
+   	  <a href="commands_manual.html#dfsadmin">dfsadmin</a>.
    	</p>  
    </section>
    
-   </section> <section> <title> Secondary NameNode </title>
-   <p>
+   </section> 
+	<section> <title>Secondary NameNode</title>
+   <note>
+   The Secondary NameNode has been deprecated. 
+   Instead, consider using the 
+   <a href="hdfs_user_guide.html#Checkpoint+Node">Checkpoint Node</a> or 
+   <a href="hdfs_user_guide.html#Backup+Node">Backup Node</a>.
+   </note>
+   <p>	
      The NameNode stores modifications to the file system as a log
-     appended to a native file system file (<code>edits</code>). 
+     appended to a native file system file, <code>edits</code>. 
    	When a NameNode starts up, it reads HDFS state from an image
-   	file (<code>fsimage</code>) and then applies edits from the
+   	file, <code>fsimage</code>, and then applies edits from the
     edits log file. It then writes new HDFS state to the <code>fsimage</code>
     and starts normal
    	operation with an empty edits file. Since NameNode merges
@@ -253,7 +281,122 @@
      read by the primary NameNode if necessary.
    </p>
    <p>
-     The latest checkpoint can be imported to the primary NameNode if
+     For command usage, see  
+     <a href="commands_manual.html#secondarynamenode">secondarynamenode</a>.
+   </p>
+   
+   </section><section> <title> Checkpoint Node </title>
+   <p>NameNode persists its namespace using two files: <code>fsimage</code>,
+      which is the latest checkpoint of the namespace and <code>edits</code>,
+      a journal (log) of changes to the namespace since the checkpoint.
+      When a NameNode starts up, it merges the <code>fsimage</code> and
+      <code>edits</code> journal to provide an up-to-date view of the
+      file system metadata.
+      The NameNode then overwrites <code>fsimage</code> with the new HDFS state 
+      and begins a new <code>edits</code> journal. 
+   </p>
+   <p>
+     The Checkpoint node periodically creates checkpoints of the namespace. 
+     It downloads <code>fsimage</code> and <code>edits</code> from the active 
+     NameNode, merges them locally, and uploads the new image back to the 
+     active NameNode.
+     The Checkpoint node usually runs on a different machine than the NameNode
+     since its memory requirements are on the same order as the NameNode.
+     The Checkpoint node is started by 
+     <code>bin/hdfs namenode -checkpoint</code> on the node 
+     specified in the configuration file.
+   </p>
+   <p>The location of the Checkpoint (or Backup) node and its accompanying 
+      web interface are configured via the <code>dfs.backup.address</code> 
+      and <code>dfs.backup.http.address</code> configuration variables.
+	 </p>
+   <p>
+     The start of the checkpoint process on the Checkpoint node is 
+     controlled by two configuration parameters.
+   </p>
+   <ul>
+      <li>
+        <code>fs.checkpoint.period</code>, set to 1 hour by default, specifies
+        the maximum delay between two consecutive checkpoints 
+      </li>
+      <li>
+        <code>fs.checkpoint.size</code>, set to 64MB by default, defines the
+        size of the edits log file that forces an urgent checkpoint even if 
+        the maximum checkpoint delay is not reached.
+      </li>
+   </ul>
+   <p>
+     The Checkpoint node stores the latest checkpoint in a  
+     directory that is structured the same as the NameNode's
+     directory. This allows the checkpointed image to be always available for
+     reading by the NameNode if necessary.
+     See <a href="hdfs_user_guide.html#Import+Checkpoint">Import Checkpoint</a>.
+   </p>
+   <p>Multiple checkpoint nodes may be specified in the cluster configuration file.</p>
+   <p>
+     For command usage, see  
+     <a href="commands_manual.html#namenode">namenode</a>.
+   </p>
+   </section>
+
+   <section> <title> Backup Node </title>
+   <p>	
+    The Backup node provides the same checkpointing functionality as the 
+    Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the
+    file system namespace that is always synchronized with the active NameNode state.
+    Along with accepting a journal stream of file system edits from 
+    the NameNode and persisting this to disk, the Backup node also applies 
+    those edits into its own copy of the namespace in memory, thus creating 
+    a backup of the namespace.
+   </p>
+   <p>
+    The Backup node does not need to download 
+    <code>fsimage</code> and <code>edits</code> files from the active NameNode
+    in order to create a checkpoint, as would be required with a 
+    Checkpoint node or Secondary NameNode, since it already has an up-to-date 
+    state of the namespace state in memory.
+    The Backup node checkpoint process is more efficient as it only needs to 
+    save the namespace into the local <code>fsimage</code> file and reset
+    <code>edits</code>.
+   </p> 
+   <p>
+    As the Backup node maintains a copy of the
+    namespace in memory, its RAM requirements are the same as the NameNode.
+   </p> 
+   <p>
+    The NameNode supports one Backup node at a time. No Checkpoint nodes may be
+    registered if a Backup node is in use. Using multiple Backup nodes 
+    concurrently will be supported in the future.
+   </p> 
+   <p>
+    The Backup node is configured in the same manner as the Checkpoint node.
+    It is started with <code>bin/hdfs namenode -checkpoint</code>.
+   </p>
+   <p>The location of the Backup (or Checkpoint) node and its accompanying 
+      web interface are configured via the <code>dfs.backup.address</code> 
+      and <code>dfs.backup.http.address</code> configuration variables.
+	 </p>
+   <p>
+    Use of a Backup node provides the option of running the NameNode with no 
+    persistent storage, delegating all responsibility for persisting the state
+    of the namespace to the Backup node. 
+    To do this, start the NameNode with the 
+    <code>-importCheckpoint</code> option, along with specifying no persistent
+    storage directories of type edits <code>dfs.name.edits.dir</code> 
+    for the NameNode configuration.
+   </p> 
+   <p>
+    For a complete discussion of the motivation behind the creation of the 
+    Backup node and Checkpoint node, see 
+    <a href="https://issues.apache.org/jira/browse/HADOOP-4539">HADOOP-4539</a>.
+    For command usage, see  
+     <a href="commands_manual.html#namenode">namenode</a>.
+   </p>
+   </section>
+
+   <section> <title> Import Checkpoint </title>
+   <p>
+     The latest checkpoint can be imported to the NameNode if
      all other copies of the image and the edits files are lost.
      In order to do that one should:
    </p>
@@ -280,10 +423,12 @@
      consistent, but does not modify it in any way.
    </p>
    <p>
-     For command usage, see <a href="commands_manual.html#secondarynamenode"><code>secondarynamenode</code> command</a>.
+     For command usage, see  
+      <a href="commands_manual.html#namenode">namenode</a>.
    </p>
-   
-   </section> <section> <title> Rebalancer </title>
+   </section>
+
+   <section> <title> Rebalancer </title>
     <p>
       HDFS data might not always be be placed uniformly across the
       DataNode. One common reason is addition of new DataNodes to an
@@ -321,7 +466,8 @@
       <a href="http://issues.apache.org/jira/browse/HADOOP-1652">HADOOP-1652</a>.
     </p>
     <p>
-     For command usage, see <a href="commands_manual.html#balancer">balancer command</a>.
+     For command usage, see  
+     <a href="commands_manual.html#balancer">balancer</a>.
    </p>
     
    </section> <section> <title> Rack Awareness </title>
@@ -357,7 +503,7 @@
       using <code>'bin/hadoop dfsadmin -safemode'</code> command. NameNode front
       page shows whether Safemode is on or off. A more detailed
       description and configuration is maintained as JavaDoc for
-      <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/NameNode.html#setSafeMode(org.apache.hadoop.dfs.FSConstants.SafeModeAction)"><code>setSafeMode()</code></a>.
+      <code>NameNode.setSafeMode()</code>.
     </p>
     
    </section> <section> <title> fsck </title>
@@ -372,7 +518,8 @@
       <code>fsck</code> ignores open files but provides an option to select all files during reporting.
       The HDFS <code>fsck</code> command is not a
       Hadoop shell command. It can be run as '<code>bin/hadoop fsck</code>'.
-      For command usage, see <a href="commands_manual.html#fsck"><code>fsck</code> command</a>. 
+      For command usage, see  
+      <a href="commands_manual.html#fsck">fsck</a>.
       <code>fsck</code> can be run on the whole file system or on a subset of files.
      </p>
    
@@ -403,7 +550,7 @@
       of Hadoop and rollback the cluster to the state it was in 
       before
       the upgrade. HDFS upgrade is described in more detail in 
-      <a href="http://wiki.apache.org/hadoop/Hadoop%20Upgrade">upgrade wiki</a>.
+      <a href="http://wiki.apache.org/hadoop/Hadoop_Upgrade">Hadoop Upgrade</a> Wiki page.
       HDFS can have one such backup at a time. Before upgrading,
       administrators need to remove existing backup using <code>bin/hadoop
       dfsadmin -finalizeUpgrade</code> command. The following
@@ -447,13 +594,13 @@
       treated as the superuser for HDFS. Future versions of HDFS will
       support network authentication protocols like Kerberos for user
       authentication and encryption of data transfers. The details are discussed in the 
-      <a href="hdfs_permissions_guide.html">HDFS Admin Guide: Permissions</a>.
+      <a href="hdfs_permissions_guide.html">Permissions Guide</a>.
      </p>
      
    </section> <section> <title> Scalability </title>
      <p>
-      Hadoop currently runs on clusters with thousands of nodes.
-      <a href="http://wiki.apache.org/hadoop/PoweredBy">Powered By Hadoop</a>
+      Hadoop currently runs on clusters with thousands of nodes. The  
+      <a href="http://wiki.apache.org/hadoop/PoweredBy">PoweredBy</a> Wiki page 
       lists some of the organizations that deploy Hadoop on large
       clusters. HDFS has one NameNode for each cluster. Currently
       the total memory available on NameNode is the primary scalability
@@ -461,8 +608,8 @@
       files stored in HDFS helps with increasing cluster size without
       increasing memory requirements on NameNode.
    
-      The default configuration may not suite very large clustes.
-      <a href="http://wiki.apache.org/hadoop/FAQ">Hadoop FAQ</a> page lists
+      The default configuration may not suite very large clustes. The 
+      <a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a> Wiki page lists
       suggested configuration improvements for large Hadoop clusters.
      </p>
      
@@ -475,18 +622,20 @@
       </p>
       <ul>
       <li>
-        <a href="http://hadoop.apache.org/">Hadoop Home Page</a>: The start page for everything Hadoop.
+        <a href="http://hadoop.apache.org/">Hadoop Site</a>: The home page for the Apache Hadoop site.
       </li>
       <li>
-        <a href="http://wiki.apache.org/hadoop/FrontPage">Hadoop Wiki</a>
-        : Front page for Hadoop Wiki documentation. Unlike this
-        guide which is part of Hadoop source tree, Hadoop Wiki is
+        <a href="http://wiki.apache.org/hadoop/FrontPage">Hadoop Wiki</a>:
+        The home page (FrontPage) for the Hadoop Wiki. Unlike the released documentation, 
+        which is part of Hadoop source tree, Hadoop Wiki is
         regularly edited by Hadoop Community.
       </li>
-      <li> <a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a> from Hadoop Wiki.
+      <li> <a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a>: 
+      The FAQ Wiki page.
       </li>
       <li>
-        Hadoop <a href="ext:api">JavaDoc API</a>.
+        Hadoop <a href="http://hadoop.apache.org/core/docs/current/api/">
+          JavaDoc API</a>.
       </li>
       <li>
         Hadoop User Mailing List : 
@@ -498,7 +647,7 @@
          description of most of the configuration variables available.
       </li>
       <li>
-        <a href="commands_manual.html">Hadoop Command Guide</a>: commands usage.
+        <a href="commands_manual.html">Hadoop Commands Guide</a>: Hadoop commands usage.
       </li>
       </ul>
      </section>